History log of /freebsd-10.0-release/sys/ufs/
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
259065 07-Dec-2013 gjb

- Copy stable/10 (r259064) to releng/10.0 as part of the
10.0-RELEASE cycle.
- Update __FreeBSD_version [1]
- Set branch name to -RC1

[1] 10.0-CURRENT __FreeBSD_version value ended at '55', so
start releng/10.0 at '100' so the branch is started with
a value ending in zero.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation

256281 10-Oct-2013 gjb

Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


255219 05-Sep-2013 pjd

Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)

#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);

bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

cap_rights_t rights;

cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by: The FreeBSD Foundation


254996 28-Aug-2013 mckusick

In looking at block layouts as part of fixing filesystem block
allocations under low free-space conditions (-r254995), determine
that old block-preference search order used before -r249782 worked
a bit better. This change reverts to that block-preference search order.

MFC after: 2 weeks


254995 28-Aug-2013 mckusick

A performance problem was reported in PR kern/181226:

I have 25TB Dell PERC 6 RAID5 array. When it becomes almost
full (10-20GB free), processes which write data to it start
eating 100% CPU and write speed drops below 1MB/sec (normally
to gives 400MB/sec). The revision at which it first became
apparent was http://svnweb.freebsd.org/changeset/base/249782.

The offending change reserved an area in each cylinder group to
store metadata. The new algorithm attempts to save this area for
metadata and allows its use for non-metadata only after all the
data areas have been exhausted. The size of the reserved area
defaults to half of minfree, so the filesystem reports full before
the data area can completely fill. However, in this report, the
filesystem has had minfree reduced to 1% thus forcing the metadata
area to be used for data. As the filesystem approached full, it
had only metadata areas left to allocate. The result was that
every block allocation had to scan summary data for 30,000 cylinder
groups before falling back to searching up to 30,000 metadata areas.

The fix is to give up on saving the metadata areas once the free
space reserve drops below 2%. The effect of this change is to use
the old algorithm of just accepting the first available block that
we find. Since most filesystems use the default 5% minfree, this
will have no effect on their operation. For those that want to push
to the limit, they will get their crappy block placements quickly.

Submitted by: Dmitry Sivachenko
Fix Tested by: Dmitry Sivachenko
PR: kern/181226
MFC after: 2 weeks


254986 28-Aug-2013 ivoras

Take a very small step toward the Century of the Anchovy by increasing the
time dirhash entries stay in memory before being considered for eviction to
1 minute.


254627 21-Aug-2013 ken

Expand the use of stat(2) flags to allow storing some Windows/DOS
and CIFS file attributes as BSD stat(2) flags.

This work is intended to be compatible with ZFS, the Solaris CIFS
server's interaction with ZFS, somewhat compatible with MacOS X,
and of course compatible with Windows.

The Windows attributes that are implemented were chosen based on
the attributes that ZFS already supports.

The summary of the flags is as follows:

UF_SYSTEM: Command line name: "system" or "usystem"
ZFS name: XAT_SYSTEM, ZFS_SYSTEM
Windows: FILE_ATTRIBUTE_SYSTEM

This flag means that the file is used by the
operating system. FreeBSD does not enforce any
special handling when this flag is set.

UF_SPARSE: Command line name: "sparse" or "usparse"
ZFS name: XAT_SPARSE, ZFS_SPARSE
Windows: FILE_ATTRIBUTE_SPARSE_FILE

This flag means that the file is sparse. Although
ZFS may modify this in some situations, there is
not generally any special handling for this flag.

UF_OFFLINE: Command line name: "offline" or "uoffline"
ZFS name: XAT_OFFLINE, ZFS_OFFLINE
Windows: FILE_ATTRIBUTE_OFFLINE

This flag means that the file has been moved to
offline storage. FreeBSD does not have any special
handling for this flag.

UF_REPARSE: Command line name: "reparse" or "ureparse"
ZFS name: XAT_REPARSE, ZFS_REPARSE
Windows: FILE_ATTRIBUTE_REPARSE_POINT

This flag means that the file is a Windows reparse
point. ZFS has special handling code for reparse
points, but we don't currently have the other
supporting infrastructure for them.

UF_HIDDEN: Command line name: "hidden" or "uhidden"
ZFS name: XAT_HIDDEN, ZFS_HIDDEN
Windows: FILE_ATTRIBUTE_HIDDEN

This flag means that the file may be excluded from
a directory listing if the application honors it.
FreeBSD has no special handling for this flag.

The name and bit definition for UF_HIDDEN are
identical to the definition in MacOS X.

UF_READONLY: Command line name: "urdonly", "rdonly", "readonly"
ZFS name: XAT_READONLY, ZFS_READONLY
Windows: FILE_ATTRIBUTE_READONLY

This flag means that the file may not written or
appended, but its attributes may be changed.

ZFS currently enforces this flag, but Illumos
developers have discussed disabling enforcement.

The behavior of this flag is different than MacOS X.
MacOS X uses UF_IMMUTABLE to represent the DOS
readonly permission, but that flag has a stronger
meaning than the semantics of DOS readonly permissions.

UF_ARCHIVE: Command line name: "uarch", "uarchive"
ZFS_NAME: XAT_ARCHIVE, ZFS_ARCHIVE
Windows name: FILE_ATTRIBUTE_ARCHIVE

The UF_ARCHIVED flag means that the file has changed and
needs to be archived. The meaning is same as
the Windows FILE_ATTRIBUTE_ARCHIVE attribute, and
the ZFS XAT_ARCHIVE and ZFS_ARCHIVE attribute.

msdosfs and ZFS have special handling for this flag.
i.e. they will set it when the file changes.

sys/param.h: Bump __FreeBSD_version to 1000047 for the
addition of new stat(2) flags.

chflags.1: Document the new command line flag names
(e.g. "system", "hidden") available to the
user.

ls.1: Reference chflags(1) for a list of file flags
and their meanings.

strtofflags.c: Implement the mapping between the new
command line flag names and new stat(2)
flags.

chflags.2: Document all of the new stat(2) flags, and
explain the intended behavior in a little
more detail. Explain how they map to
Windows file attributes.

Different filesystems behave differently
with respect to flags, so warn the
application developer to take care when
using them.

zfs_vnops.c: Add support for getting and setting the
UF_ARCHIVE, UF_READONLY, UF_SYSTEM, UF_HIDDEN,
UF_REPARSE, UF_OFFLINE, and UF_SPARSE flags.

All of these flags are implemented using
attributes that ZFS already supports, so
the on-disk format has not changed.

ZFS currently doesn't allow setting the
UF_REPARSE flag, and we don't really have
the other infrastructure to support reparse
points.

msdosfs_denode.c,
msdosfs_vnops.c: Add support for getting and setting
UF_HIDDEN, UF_SYSTEM and UF_READONLY
in MSDOSFS.

It supported SF_ARCHIVED, but this has been
changed to be UF_ARCHIVE, which has the same
semantics as the DOS archive attribute instead
of inverse semantics like SF_ARCHIVED.

After discussion with Bruce Evans, change
several things in the msdosfs behavior:

Use UF_READONLY to indicate whether a file
is writeable instead of file permissions, but
don't actually enforce it.

Refuse to change attributes on the root
directory, because it is special in FAT
filesystems, but allow most other attribute
changes on directories.

Don't set the archive attribute on a directory
when its modification time is updated.
Windows and DOS don't set the archive attribute
in that scenario, so we are now bug-for-bug
compatible.

smbfs_node.c,
smbfs_vnops.c: Add support for UF_HIDDEN, UF_SYSTEM,
UF_READONLY and UF_ARCHIVE in SMBFS.

This is similar to changes that Apple has
made in their version of SMBFS (as of
smb-583.8, posted on opensource.apple.com),
but not quite the same.

We map SMB_FA_READONLY to UF_READONLY,
because UF_READONLY is intended to match
the semantics of the DOS readonly flag.
The MacOS X code maps both UF_IMMUTABLE
and SF_IMMUTABLE to SMB_FA_READONLY, but
the immutable flags have stronger meaning
than the DOS readonly bit.

stat.h: Add definitions for UF_SYSTEM, UF_SPARSE,
UF_OFFLINE, UF_REPARSE, UF_ARCHIVE, UF_READONLY
and UF_HIDDEN.

The definition of UF_HIDDEN is the same as
the MacOS X definition.

Add commented-out definitions of
UF_COMPRESSED and UF_TRACKED. They are
defined in MacOS X (as of 10.8.2), but we
do not implement them (yet).

ufs_vnops.c: Add support for getting and setting
UF_ARCHIVE, UF_HIDDEN, UF_OFFLINE, UF_READONLY,
UF_REPARSE, UF_SPARSE, and UF_SYSTEM in UFS.
Alphabetize the flags that are supported.

These new flags are only stored, UFS does
not take any action if the flag is set.

Sponsored by: Spectra Logic
Reviewed by: bde (earlier version)


253998 06-Aug-2013 mckusick

This bug fix is in a code path in rename taken when there is a
collision between a rename and an open system call for the same
target file. Here, rename releases its vnode references, waits for
the open to finish, and then restarts by reacquiring its needed
vnode locks. In this case, rename was unlocking but failing to
release its reference to one of its held vnodes. The effect was
that even after all the actual references to the vnode had gone,
the vnode still showed active references. For files that had been
removed, their space was not reclaimed until the filesystem was
forcibly unmounted.

This bug manifested itself in the Postgres server which would
leak/lose hundreds of files per day amounting to many gigabytes of
disk space. This bug required shutting down Postgres, forcibly
unmounting its filesystem, remounting its filesystem and restarting
Postgres every few days to recover the lost space.

Reported by: Dan Thomas and Palle Girgensohn
Bug-fix by: kib
Tested by: Dan Thomas and Palle Girgensohn
MFC after: 2 weeks


253974 05-Aug-2013 mckusick

With the addition of journalled soft updates, the "newblk" structures
persist much longer than previously. Historically we had at most 100
entries; now the count may reach a million. With the increased count
we spent far too much time looking them up in the grossly undersized
newblk hash table. Configure the newblk hash table to accurately reflect
the number of entries that it must index.

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks


253973 05-Aug-2013 mckusick

To better understand performance problems with journalled soft updates,
we need to collect the highest level of allocation for each of the
different soft update dependency structures. This change collects these
statistics and makes them available using `sysctl debug.softdep.highuse'.

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks


253341 14-Jul-2013 mckusick

Update to comments describing block allocation policy.

Submitted by: Bruce Evans


253280 12-Jul-2013 kib

Only copy as much bytes as there in superblock, instead of the full
block copy, when copying the superblock into the snapshot. UFS1 does
not align superblock on the block boundary, and bcopy runs off the end
of the buffer.

Reported by: Andre Albsmeier <Andre.Albsmeier@siemens.com>
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


253163 10-Jul-2013 pfg

Change i_gen in UFS to an unsigned type.

Missing type change from r252435.

This fixes a "Stale NFS file handle" error.

Reported by: Claude Bisson
Tested by: Claude Bisson
Pointed hat: pfg


253106 09-Jul-2013 kib

There are several code sequences like
vfs_busy(mp);
vfs_write_suspend(mp);
which are problematic if other thread starts unmount between two
calls. The unmount starts a write, while vfs_write_suspend() drain
writers. On the other hand, unmount drains busy references, causing
the deadlock.

Add a flag argument to vfs_write_suspend and require the callers of it
to specify VS_SKIP_UNMOUNT flag, when the call is performed not in the
mount path, i.e. the covered vnode is not locked. The suspension is
not attempted if VS_SKIP_UNMOUNT is specified and unmount is in
progress.

Reported and tested by: Andreas Longwitz <longwitz@incore.de>
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks


252527 02-Jul-2013 mckusick

Make better use of metadata area by avoiding using it for data blocks
that no should no longer immediately follow their indirect blocks.

MFC after: 2 weeks


252515 02-Jul-2013 pfg

Style fix: spaces.

Cleanup the incomplete revert.

Reported by: bde
MFC after: 4 weeks


252484 01-Jul-2013 pfg

Change i_gen in UFS to an unsigned type.

Revert the simplification of the i_gen calculation.
It is still a good idea to avoid zero values and for the case
of old filesystems there is probably no advantage in using
the complete 32 bits anyways.

Discussed with: bde
MFC after: 4 weeks


252467 01-Jul-2013 pfg

Change i_gen in UFS to an unsigned type.

Further simplify the i_gen calculation for older disks.
Having a zero here is not really a problem and this is more
similar to what is done in newfs_random().

Reported by: Xin Li
MFC after: 4 weeks


252438 01-Jul-2013 gleb

Don't assume that UFS on-disk format of a directory is the same as
defined by <sys/dirent.h>

Always start parsing at DIRBLKSIZ aligned offset, skip first entries if
uio_offset is not DIRBLKSIZ aligned. Return EINVAL if buffer is too
small for single entry.

Preallocate buffer for cookies. Cookies will be replaced with d_off
field in struct dirent at later point.

Skip entries with zero inode number.

Stop mangling dirent in ufs_extattr_iterate_directory().

Reviewed by: kib
Sponsored by: Google Summer Of Code 2011


252437 01-Jul-2013 pfg

Change i_gen in UFS to an unsigned type.

Missed format specifier.

Reported by: mdf
MFC after: 4 weeks


252435 01-Jul-2013 pfg

Change i_gen in UFS to an unsigned type.

In UFS, i_gen is a random generated value and there is not way for
it to be negative. Actually, the value of i_gen is just used to
match bit patterns and it is of not consequence if the values are
signed or not.

Following other filesystems, set it to unsigned and use it as such,

Discussed by: mckusick
Reviewed by: mckusick (previous version)
MFC after: 4 weeks


251171 31-May-2013 jeff

- Convert the bufobj lock to rwlock.
- Use a shared bufobj lock in getblk() and inmem().
- Convert softdep's lk to rwlock to match the bufobj lock.
- Move INFREECNT to b_flags and protect it with the buf lock.
- Remove unnecessary locking around bremfree() and BKGRDINPROG.

Sponsored by: EMC / Isilon Storage Division
Discussed with: mckusick, kib, mdf


250901 22-May-2013 mckusick

Properly spell sentinel (missed in 250891)
No functional changes.

Spotted by: Navdeep Parhar and Alexey Dokuchaev
MFC after: 2 weeks


250897 22-May-2013 mckusick

Add missing buffer releases (brelse) after bread calls that return
an error. One could argue that returning a buffer even when it is
not valid is incorrect, but bread has always returned a buffer
valid or not.

Reviewed by: kib
MFC after: 2 weeks


250895 22-May-2013 mckusick

Add missing 28th element to softdep types name array.

Found by: Coverity Scan, CID 1007621
Reviewed by: kib
MFC after: 2 weeks


250894 22-May-2013 mckusick

Null a pointer after it is freed so that when it is returned
the return value is NULL. Based on the returned flags, the
return value should never be inspected in the case where NULL
is returned, but it is good coding practice not to return a
pointer to freed memory.

Found by: Coverity Scan, CID 1006096
Reviewed by: kib
MFC after: 2 weeks


250892 22-May-2013 mckusick

Remove a bogus check for a NULL buffer pointer.
Add a KASSERT that it is not NULL.

Found by: Coverity Scan, CID 1009114
Reviewed by: kib
MFC after: 2 weeks


250891 22-May-2013 mckusick

Properly spell sentinel (not sintenel or sentinal).
No functional changes.

Spotted by: kib
MFC after: 2 weeks


250576 12-May-2013 eadler

Fix several typos

PR: kern/176054
Submitted by: Christoph Mallon <christoph.mallon@gmx.de>
MFC after: 3 days


249582 17-Apr-2013 gabor

- Correct mispellings of the word occurrence

Submitted by: Christoph Mallon <christoph.mallon@gmx.de> (via private mail)


249218 06-Apr-2013 jeff

Prepare to replace the buf splay with a trie:

- Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists.
No consumers need to find them there and it complicates the tree.
These flags are all FFS specific and could be moved out of the buf
cache.
- Use pbgetvp() and pbrelvp() to associate the background and journal
bufs with the vp. Not only is this much cheaper it makes more sense
for these transient bufs.
- Fix the assertions in pbget* and pbrel*. It's not safe to check list
pointers which were never initialized. Use the BX flags instead. We
also check B_PAGING in reassignbuf() so this should cover all cases.

Discussed with: kib, mckusick, attilio
Sponsored by: EMC / Isilon Storage Division


249064 03-Apr-2013 mckusick

The code in clear_remove() and clear_inodedeps() skips one entry
in the pagedep and inodedep hash tables. An entry in the table is
skipped because 'pagedep_hash' and 'inodedep_hash' hold the size
of the hash tables - 1.

The chance that this would have any operational failure is extremely
unlikely. These funtions only need to find a single entry and are
only called when there are too many entries. The chance that they
would fail because all the entries are on the single skipped hash
chain are remote.

Submitted by: Pedro Martelletto
Reviewed by: kib
MFC after: 2 weeks


248623 22-Mar-2013 mckusick

The purpose of this change to the FFS layout policy is to reduce the
running time for a full fsck. It also reduces the random access time
for large files and speeds the traversal time for directory tree walks.

The key idea is to reserve a small area in each cylinder group
immediately following the inode blocks for the use of metadata,
specifically indirect blocks and directory contents. The new policy
is to preferentially place metadata in the metadata area and
everything else in the blocks that follow the metadata area.

The size of this area can be set when creating a filesystem using
newfs(8) or changed in an existing filesystem using tunefs(8).
Both utilities use the `-k held-for-metadata-blocks' option to
specify the amount of space to be held for metadata blocks in each
cylinder group. By default, newfs(8) sets this area to half of
minfree (typically 4% of the data area).

This work was inspired by a paper presented at Usenix's FAST '13:
www.usenix.org/conference/fast13/ffsck-fast-file-system-checker

Details of this implementation appears in the April 2013 of ;login:
www.usenix.org/publications/login/april-2013-volume-38-number-2.
A copy of the April 2013 ;login: paper can also be downloaded
from: www.mckusick.com/publications/faster_fsck.pdf.

Reviewed by: kib
Tested by: Peter Holm
MFC after: 4 weeks


248561 20-Mar-2013 mckusick

When renaming a directory from one parent directory to another,
we need to call ufs_checkpath() to walk from our new location to
the root of the filesystem to ensure that we do not encounter
ourselves along the way. Until now, we accomplished this by reading
the ".." entries of each directory in our path until we reached
the root (or encountered an error). This change tries to avoid the
I/O of reading the ".." entries by first looking them up in the
name cache and only doing the I/O when the name cache lookup fails.

Reviewed by: kib
Tested by: Peter Holm
MFC after: 4 weeks


248521 19-Mar-2013 kib

UFS support of the unmapped i/o for the user data buffers.

Sponsored by: The FreeBSD Foundation
Tested by: pho, scottl, jhb, bf


248515 19-Mar-2013 kib

Do not remap usermode pages into KVA for physio.

Sponsored by: The FreeBSD Foundation
Tested by: pho


248422 17-Mar-2013 kib

Remove negative name cache entry pointing to the target name, which
could be instantiated while tdvp was unlocked.

Reported by: Rick Miller <vmiller at hostileadmin com>
Tested by: pho
MFC after: 1 week


248283 14-Mar-2013 kib

Some style fixes.

Sponsored by: The FreeBSD Foundation


248282 14-Mar-2013 kib

Add currently unused flag argument to the cluster_read(),
cluster_write() and cluster_wbuild() functions. The flags to be
allowed are a subset of the GB_* flags for getblk().

Sponsored by: The FreeBSD Foundation
Tested by: pho


248084 09-Mar-2013 attilio

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


247388 27-Feb-2013 kib

The softdep freeblks workitem might hold a reference on the dquot.
Current dqflush() panics when a dquot with with non-zero refcount is
encountered. The situation is possible, because quotas are turned off
before softdep workitem queue if flushed, due to the quota file writes
might create softdep workitems.

Make the encountering an active dquot in dqflush() not fatal, return
the error from quotaoff() instead. Ignore the quotaoff() failures
when ffs_flushfiles() is called in the course of softdep_flushfiles()
loop, until the last iteration. At the last loop, the quotas must be
closed, and because SU workitems should be already flushed, the
references to dquot are gone.

Sponsored by: The FreeBSD Foundation
Reported and tested by: pho
Reviewed by: mckusick
MFC after: 2 weeks


247387 27-Feb-2013 kib

An inode block must not be blockingly read while cg block is owned.
The order is inode buffer lock -> snaplk -> cg buffer lock, reversing
the order causes deadlocks.

Inode block must not be written while cg block buffer is owned. The
FFS copy on write needs to allocate a block to copy the content of the
inode block, and the cylinder group selected for the allocation might
be the same as the owned cg block. The reserved block detection code
in the ffs_copyonwrite() and ffs_bp_snapblk() is unable to detect the
situation, because the locked cg buffer is not exposed to it.

In order to maintain the dependency between initialized inode block
and the cg_initediblk pointer, look up the inode buffer in
non-blocking mode. If succeeded, brelse cg block, initialize the inode
block and write it. After the write is finished, reread cg block and
update the cg_initediblk.

If inode block is already locked by another thread, let the another
thread initialize it. If another thread raced with us after we
started writing inode block, the situation is detected by an update of
cg_initediblk. Note that double-initialization of the inode block is
harmless, the block cannot be used until cg_initediblk is incremented.

Sponsored by: The FreeBSD Foundation
In collaboration with: pho
Reviewed by: mckusick
MFC after: 1 month
X-MFC-note: after r246877


246877 16-Feb-2013 mckusick

The UFS2 filesystem allocates new blocks of inodes as they are needed.
When a cylinder group runs short of inodes, a new block for inodes is
allocated, zero'ed, and written to the disk. The zero'ed inodes must
be on the disk before the cylinder group can be updated to claim them.
If the cylinder group claiming the new inodes were written before the
zero'ed block of inodes, the system could crash with the filesystem in
an unrecoverable state.

Rather than adding a soft updates dependency to ensure that the new
inode block is written before it is claimed by the cylinder group
map, we just do a barrier write of the zero'ed inode block to ensure
that it will get written before the updated cylinder group map can
be written. This change should only slow down bulk loading of newly
created filesystems since that is the primary time that new inode
blocks need to be created.

Reported by: Robert Watson
Reviewed by: kib
Tested by: Peter Holm


246612 10-Feb-2013 kib

Fix several unsafe pointer dereferences in the buffered_write()
function, implementing the sysctl vfs.ffs.set_bufoutput (not used in
the tree yet).

- The current directory vnode dereference is unsafe since fd_cdir
could be changed and unreferenced, lock the filedesc around and vref
the fd_cdir.
- The VTOI() conversion of the fd_cdir is unsafe without first
checking that the vnode is indeed from an FFS mount, otherwise
the code dereferences a random memory.
- The cdir could be reclaimed from under us, lock it around the
checks.
- The type of the fp vnode might be not a disk, or it might have
changed while the thread was in flight, check the type.

Reviewed and tested by: mckusick
MFC after: 2 weeks


246562 08-Feb-2013 pfg

Remove unused MAXSYMLINKLEN macro.

Reviewed by: mckusick
PR: kern/175794
MFC after: 1 week


246299 03-Feb-2013 pfg

UFS: Remove dead assignment.

Submitted by: Christoph Mallon
MFC after: 3 days


246289 03-Feb-2013 mckusick

For UFS2 i_blocks is unsigned. The current "sanity" check that it
has gone below zero after the blocks in its inode are freed is a
no-op which the compiler fails to warn about because of the use of
the DIP macro. Change the sanity check to compare the number of
blocks being freed against the value i_blocks. If the number of
blocks being freed exceeds i_blocks, just set i_blocks to zero.

Reported by: Pedro Giffuni (pfg@)
MFC after: 2 weeks


245286 11-Jan-2013 kib

Add flags argument to vfs_write_resume() and remove
vfs_write_resume_flags().

Sponsored by: The FreeBSD Foundation


244925 01-Jan-2013 kib

The process_deferred_inactive() function locks the vnodes of the ufs
mount, which means that is must not be called while the snaplock is
owned. The vfs_write_resume(9) does call the function as the
VFS_SUSP_CLEAN() method, which is too early and falls into the region
still protected by snaplock.

Add yet another flag for the vfs_write_resume_flags() to avoid calling
suspension cleanup handler after the suspend is lifted, and use it in
the ffs_snapshot() call to vfs_write_resume.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


244795 28-Dec-2012 kib

Make it possible to atomically resume writes on the mount and account
the write start, by adding a variation of the vfs_write_resume(9)
which accepts flags.

Use the new function to prevent a deadlock between parallel suspension
and snapshotting a UFS mount. The ffs_snapshot() code performed
vfs_write_resume() followed by vn_start_write() while owning the
snaplock. If the suspension intervene between resume and
vn_start_write(), the deadlock occured after the suspending thread
tried to lock the snaplock, most typically during the write in the
ffs_copyonwrite().

Reported and tested by: Andreas Longwitz <longwitz@incore.de>
Reviewed by: mckusick
MFC after: 2 weeks
X-MFC-note: make the vfs_write_resume(9) function a macro after the MFC,
in HEAD


244534 21-Dec-2012 attilio

Fixup r218424: uio_yield() was scaling directly to userland priority.
When kern_yield() was introduced with the possibility to specify
a new priority, the behaviour changed by not lowering priority at all
in the consumers, making the yielding mechanism highly ineffective for
high priority kthreads like bufdaemon, syncer, vlrudaemon, etc.
There are no evidences that consumers could bear with such change in
semantic and this situation could finally lead to bugs similar to the
ones fixed in r244240.
Re-specify userland pri for kthreads involved.

Tested by: pho
Reviewed by: kib, mdf
MFC after: 1 week


244239 15-Dec-2012 kib

Fix a typo, resulting in the NULL pointer dereference.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


243311 19-Nov-2012 attilio

r16312 is not any longer real since many years (likely since when VFS
received granular locking) but the comment present in UFS has been
copied all over other filesystems code incorrectly for several times.

Removes comments that makes no sense now.

Reviewed by: kib
MFC after: 3 days


243250 18-Nov-2012 trasz

Fix build of kdump(1).


243245 18-Nov-2012 trasz

Add UFS writesuspension mechanism, designed to allow userland processes
to modify on-disk metadata for filesystems mounted for write.

Reviewed by: kib, mckusick
Sponsored by: FreeBSD Foundation


243018 14-Nov-2012 jeff

- Fix a truncation bug with softdep journaling that could leak blocks on
crash. When truncating a file that never made it to disk we use the
canceled allocation dependencies to hold the journal records until
the truncation completes. Previously allocdirect dependencies on
the id_bufwait list were not considered and their journal space
could expire before the bitmaps were written. Cancel them and attach
them to the freeblks as we do for other allocdirects.
- Add KTR traces that were used to debug this problem.
- When adding jsegdeps, always use jwork_insert() so we don't have more
than one segdep on a given jwork list.

Sponsored by: EMC / Isilon Storage Division


242924 12-Nov-2012 jeff

- Fix a bug that has existed since the original softdep implementation.
When a background copy of a cg is written we complete any work associated
with that bmsafemap. If new work has been added to the non-background
copy of the buffer it will be completed before the next write happens.
The solution is to do the rollbacks when we make the copy so only those
dependencies that were present at the time of writing will be completed
when the background write completes. This would've resulted in various
bitmap related corruptions and panics. It also would've expired journal
entries early causing journal replay to miss some records.

MFC after: 2 weeks


242833 09-Nov-2012 attilio

Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag.
Porters should refer to __FreeBSD_version 1000021 for this change as
it may have happened at the same timeframe.


242815 09-Nov-2012 jeff

- Correct rev 242734, segments can sometimes get stuck. Be a bit more
defensive with segment state.

Reported by: b. f. <bf1783@googlemail.com>


242734 08-Nov-2012 jeff

- Implement BIO_FLUSH support around journal entries. This will not 100%
solve power loss problems with dishonest write caches. However, it
should improve the situation and force a full fsck when it is unable
to resolve with the journal.
- Resolve a case where the journal could wrap in an unsafe way causing
us to prematurely lose journal entries in very specific scenarios.

Discussed with: mckusick
MFC after: 1 month


242520 03-Nov-2012 mckusick

When a file is first being written, the dynamic block reallocation
(implemented by ffs_reallocblks_ufs[12]) relocates the file's blocks
so as to cluster them together into a contiguous set of blocks on
the disk.

When the cluster crosses the boundary into the first indirect block,
the first indirect block is initially allocated in a position
immediately following the last direct block. Block reallocation
would usually destroy locality by moving the indirect block out of
the way to keep the data blocks contiguous. This change compensates
for this problem by noting that the first indirect block should be
left immediately following the last direct block. It then tries
to start a new cluster of contiguous blocks (referenced by the
indirect block) immediately following the indirect block.

We should also do this for other indirect block boundaries, but it
is only important for the first one.

Suggested by: Bruce Evans
MFC: 2 weeks


242492 02-Nov-2012 jeff

- In cancel_mkdir_dotdot don't panic if the inodedep is not available. If
the previous diradd had already finished it could have been reclaimed
already. This would only happen under heavy dependency pressure.

Reported by: Andrey Zonov <zont@FreeBSD.org>
Discussed with: mckusick
MFC after: 1 week


242476 02-Nov-2012 kib

The r241025 fixed the case when a binary, executed from nullfs mount,
was still possible to open for write from the lower filesystem. There
is a symmetric situation where the binary could already has file
descriptors opened for write, but it can be executed from the nullfs
overlay.

Handle the issue by passing one v_writecount reference to the lower
vnode if nullfs vnode has non-zero v_writecount. Note that only one
write reference can be donated, since nullfs only keeps one use
reference on the lower vnode. Always use the lower vnode v_writecount
for the checks.

Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is
currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT
to manipulate the v_writecount value, which manages a single bypass
reference to the lower vnode. Caling the VOPs instead of directly
accessing v_writecount provide the fix described in the previous
paragraph.

Tested by: pho
MFC after: 3 weeks


242379 30-Oct-2012 trasz

Fix problem with geom_label(4) not recognizing UFS labels on filesystems
extended using growfs(8). The problem here is that geom_label checks if
the filesystem size recorded in UFS superblock is equal to the provider
(i.e. device) size. This check cannot be removed due to backward
compatibility. On the other hand, in most cases growfs(8) cannot set
fs_size in the superblock to match the provider size, because, differently
from newfs(8), it cannot recompute cylinder group sizes.

To fix this problem, add another superblock field, fs_providersize, used
only for this purpose. The geom_label(4) will attach if either fs_size
(filesystem created with newfs(8)) or fs_providersize (filesystem expanded
using growfs(8)) matches the device size.

PR: kern/165962
Reviewed by: mckusick
Sponsored by: FreeBSD Foundation


242259 28-Oct-2012 trasz

Fix two problems that caused instant panic when the device mounted
with softupdates went away. Note that this does not fix the problem
entirely; I'm committing it now to make it easier for someone to pick
up the work.

Reviewed by: mckusick


241896 22-Oct-2012 kib

Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho


241011 27-Sep-2012 mdf

Fix up kernel sources to be ready for a 64-bit ino_t.

Original code by: Gleb Kurtsou


239359 17-Aug-2012 mjg

Remove unused member of struct indir (in_exists) from UFS and EXT2 code.

Reviewed by: mckusick
Approved by: trasz (mentor)
MFC after: 1 week


239065 05-Aug-2012 kib

After the PHYS_TO_VM_PAGE() function was de-inlined, the main reason
to pull vm_param.h was removed. Other big dependency of vm_page.h on
vm_param.h are PA_LOCK* definitions, which are only needed for
in-kernel code, because modules use KBI-safe functions to lock the
pages.

Stop including vm_param.h into vm_page.h. Include vm_param.h
explicitely for the kernel code which needs it.

Suggested and reviewed by: alc
MFC after: 2 weeks


238697 22-Jul-2012 kevlo

Use NULL instead of 0 for pointers


238029 02-Jul-2012 kib

Extend the KPI to lock and unlock f_offset member of struct file. It
now fully encapsulates all accesses to f_offset, and extends f_offset
locking to other consumers that need it, in particular, to lseek() and
variants of getdirentries().

Ensure that on 32bit architectures f_offset, which is 64bit quantity,
always read and written under the mtxpool protection. This fixes
apparently easy to trigger race when parallel lseek()s or lseek() and
read/write could destroy file offset.

The already broken ABI emulations, including iBCS and SysV, are not
converted (yet).

Tested by: pho
No objections from: jhb
MFC after: 3 weeks


237366 21-Jun-2012 kib

Fix unbounded-length malloc, controlled from usermode. The added check
is performed before exact size of the buffer is calculated, but the
buffer cannot have size greater then the total space allocated for
extended attributes. The existing check is executing with precise
size, but it is too late, since buffer needs to be allocated in
advance.

Also, adapt to uio_resid being of ssize_t type. Use lblktosize instead of
multiplying by fs block size by hand as well.

Reported and tested by: pho
MFC after: 1 week


236937 11-Jun-2012 mckusick

In softdep_setup_inomapdep() we may have to allocate both inodedep
and bmsafemap dependency structures in inodedep_lookup() and
bmsafemap_lookup() respectively. The setup of these structures must
be done while holding the soft-dependency mutex. If the inodedep is
allocated first, it may be freed in the I/O completion callback when
the mutex is released to allocate the bmsafemap. If the bmsafemap is
allocated first, it may be freed in the I/O completion callback when
the mutex is released to allocate the inodedep.

To resolve this problem, bmsafemap_lookup has had a parameter added
that allows a pre-malloc'ed bmsafemap to be passed in so that it does
not need to release the mutex to create a new bmsafemap. The
softdep_setup_inomapdep() routine pre-malloc's a bmsafemap dependency
before acquiring the mutex and starting to build the inodedep with a
call to inodedep_lookup(). The subsequent call to bmsafemap_lookup()
is passed this pre-allocated bmsafemap entry so that it need not
release the mutex if it needs to create a new one.

Reported by: Peter Holm
Tested by: Peter Holm
MFC after: 1 week


236322 30-May-2012 kib

Enable vn_io_fault() lock avoidance for UFS.

Tested by: pho
MFC after: 2 months


236044 26-May-2012 kib

Implement SEEK_HOLE/SEEK_DATA for UFS.

MFC after: 2 weeks


235610 18-May-2012 mckusick

Add missing `continue' statement at end of case.

Found by: Kevin Lo (kevlo@)
MFC after: 1 week


234613 23-Apr-2012 trasz

Remove unused thread argument from ufs_extattr_uepm_lock()/ufs_extattr_uepm_unlock().


234612 23-Apr-2012 trasz

Fix build.


234608 23-Apr-2012 trasz

Remove unused thread argument from clear_inodeps() and clear_remove().


234607 23-Apr-2012 trasz

Remove unused thread argument to vrecycle().

Reviewed by: kib


234605 23-Apr-2012 trasz

Remove unused thread argument from vtruncbuf().

Reviewed by: kib


234537 21-Apr-2012 trasz

Fix use-after-free introduced in r234036.

Reviewed by: mckusick
Tested by: pho


234483 20-Apr-2012 mckusick

This update uses the MNT_VNODE_FOREACH_ACTIVE interface that loops
over just the active vnodes associated with a mount point to replace
MNT_VNODE_FOREACH_ALL in the vfs_msync, ffs_sync_lazy, and qsync
routines.

The vfs_msync routine is run every 30 seconds for every writably
mounted filesystem. It ensures that any files mmap'ed from the
filesystem with modified pages have those pages queued to be
written back to the file from which they are mapped.

The ffs_lazy_sync and qsync routines are run every 30 seconds for
every writably mounted UFS/FFS filesystem. The ffs_lazy_sync routine
ensures that any files that have been accessed in the previous
30 seconds have had their access times queued for updating in the
filesystem. The qsync routine ensures that any files with modified
quotas have those quotas queued to be written back to their
associated quota file.

In a system configured with 250,000 vnodes, less than 1000 are
typically active at any point in time. Prior to this change all
250,000 vnodes would be locked and inspected twice every minute
by the syncer. For UFS/FFS filesystems they would be locked and
inspected six times every minute (twice by each of these three
routines since each of these routines does its own pass over the
vnodes associated with a mount point). With this change the syncer
now locks and inspects only the tiny set of vnodes that are active.

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks


234421 18-Apr-2012 jh

The part about exec atime no longer applies in the comment.

Pointed out by: bde


234386 17-Apr-2012 mckusick

Replace the MNT_VNODE_FOREACH interface with MNT_VNODE_FOREACH_ALL.
The primary changes are that the user of the interface no longer
needs to manage the mount-mutex locking and that the vnode that
is returned has its mutex locked (thus avoiding the need to check
to see if its is DOOMED or other possible end of life senarios).

To minimize compatibility issues for third-party developers, the
old MNT_VNODE_FOREACH interface will remain available so that this
change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH
will be removed in head.

The reason for this update is to prepare for the addition of the
MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the
active vnodes associated with a mount point (typically less than
1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks


234158 11-Apr-2012 mckusick

Export vinactive() from kern/vfs_subr.c (e.g., make it no longer
static and declare its prototype in sys/vnode.h) so that it can be
called from process_deferred_inactive() (in ufs/ffs/ffs_snapshot.c)
instead of the body of vinactive() being cut and pasted into
process_deferred_inactive().

Reviewed by: kib
MFC after: 2 weeks


234103 10-Apr-2012 jh

- Return EPERM from ufs_setattr() when an user without PRIV_VFS_SYSFLAGS
privilege attempts to toggle SF_SETTABLE flags.
- Use the '^' operator in the SF_SNAPSHOT anti-toggling check.

Flags are now stored to ip->i_flags in one place after all checks.

Submitted by: bde


234036 08-Apr-2012 trasz

Fix panic in ffs_reload(), which may happen when read-only filesystem
gets resized and then reloaded.

Reviewed by: kib, mckusick (earlier version)
Sponsored by: The FreeBSD Foundation


234024 08-Apr-2012 mckusick

Drop an unnecessary setting of si_mountpt when updating a UFS mount point.
Clearly it must have been set when the mount was done.

Reviewed by: kib


233875 04-Apr-2012 jh

Add a check for unsupported file flags to ufs_setattr().

Discussed with: bde
MFC after: 2 weeks


233817 02-Apr-2012 mckusick

A file cannot be deallocated until its last name has been removed
and it is no longer referenced by a user process. The inode for a
file whose name has been removed, but is still referenced at the
time of a crash will still be allocated in the filesystem, but will
have no references (e.g., they will have no names referencing them
from any directory).

With traditional soft updates these unreferenced inodes will be
found and reclaimed when the background fsck is run. When using
journaled soft updates, the kernel must keep track of these inodes
so that it can find and reclaim them during the cleanup process.
Their existence cannot be stored in the journal as the journal only
handles short-term events, and they may persist for days. So, they
are tracked by keeping them in a linked list whose head pointer is
stored in the superblock. The journal tracks them only until their
linked list pointers have been commited to disk. Part of the cleanup
process involves traversing the list of unreferenced inodes and
reclaiming them.

This bug was triggered when confusion arose in the commit steps
of keeping the unreferenced-inode linked list coherent on disk.
Notably, a race between the link() system call adding a link-count
to a file and the unlink() system call removing a link-count to
the file. Here if the unlink() ran after link() had looked up
the file but before link() had incremented the link-count of the
file, the file's link-count would drop to zero before the link()
incremented it back up to one. If the file was referenced by a
user process, the first transition through zero made it appear
that it should be added to the unreferenced-inode list when in
fact it should not have been added. If the new name created by
link() was deleted within a few seconds (with the file still
referenced by a user process) it would legitimately be a candidate
for addition to the unreferenced-inode list. The result was that
there were two attempts to add the same inode to the unreferenced-inode
list which scrambled the unreferenced-inode list's pointers leading
to a panic. The fix is to detect and avoid the false attempt at
adding it to the unreferenced-inode list by having the link()
system call check to see if the link count is zero before it
increments it. If it is, the link() fails with ENOENT (showing that
it has failed the link()/unlink() race).

While tracking down this bug, we have added additional assertions
to detect the problem sooner and also simplified some of the code.

Reported by: Kirk Russell
Fix submitted by: Jeff Roberson
Tested by: Peter Holm
PR: kern/159971
MFC (to 9 only): 2 weeks


233787 02-Apr-2012 jh

- Use more natural ip->i_flags instead of vap->va_flags in the final
flags check.
- Add a comment for the immutable/append check done after handling of
the flags.
- Style improvements.

No functional change intended.

Submitted by: bde
MFC after: 2 weeks


233629 28-Mar-2012 mckusick

A refinement of change 232351 to avoid a race with a forcible unmount.
While we have a snapshot vnode unlocked to avoid a deadlock with another
inode in the same inode block being updated, the filesystem containing
it may be forcibly unmounted. When that happens the snapshot vnode is
revoked. We need to check for that condition and fail appropriately.

This change will be included along with 232351 when it is MFC'ed to 9.

Spotted by: kib
Reviewed by: kib


233627 28-Mar-2012 mckusick

Keep track of the mount point associated with a special device
to enable the collection of counts of synchronous and asynchronous
reads and writes for its associated filesystem. The counts are
displayed using `mount -v'.

Ensure that buffers used for paging indicate the vnode from
which they are operating so that counts of paging I/O operations
from the filesystem are collected.

This checkin only adds the setting of the mount point for the
UFS/FFS filesystem, but it would be trivial to add the setting
and clearing of the mount point at filesystem mount/unmount
time for other filesystems too.

Reviewed by: kib


233610 28-Mar-2012 kib

Do trivial reformatting of the comment to record the missed commit
message for r233609:
Restore the writes of atimes, quotas and superblock from syncer vnode.

Noted by: rdivacky


233609 28-Mar-2012 kib

Reviewed by: bde, mckusick
Tested by: pho
MFC after: 2 weeks


233608 28-Mar-2012 kib

Microoptimize: in qsync loop over mount vnodes, only unlock mount
interlock after we committed to try to vget() the vnode.

Submitted by: bde
Reviewed by: mckusick
Tested by: pho
MFC after: 1 week


233607 28-Mar-2012 kib

Update comment.

MFC after: 3 days


233438 25-Mar-2012 mckusick

Add a third flags argument to ffs_syncvnode to avoid a possible conflict
with MNT_WAIT flags that passed in its second argument. This will be
MFC'ed together with r232351.

Discussed with: kib


232948 13-Mar-2012 kib

Supply boolean as the second argument to ffs_update(), and not a
MNT_[NO]WAIT constants, which in fact always caused sync operation.

Based on the submission by: bde
Reviewed by: mckusick
MFC after: 2 weeks


232837 11-Mar-2012 kib

Remove superfluous brackets.

Submitted by: alc
MFC after: 2 weeks


232836 11-Mar-2012 kib

Do schedule delayed writes for async mounts.
While there, make some style adjustments, like missed () around
return values.

Submitted by: bde
Reviewed by: mckusick
Tested by: pho
MFC after: 2 weeks


232835 11-Mar-2012 kib

Do not fall back to slow synchronous i/o when low on memory or buffers.
The bawrite() schedules the write to happen immediately, and its use
frees the current thread to do more cleanups.

Submitted by: bde
Reviewed by: mckusick
Tested by: pho
MFC after: 2 weeks


232834 11-Mar-2012 kib

In ffs_syncvnode(), pass boolean false as second argument of ffs_update().
Synchronous inode block update is not needed for MNT_LAZY callers (syncer),
and since waitfor values are not zero, code did unneccessary synchronous
update.

Submitted by: bde
Reviewed by: mckusick
Tested by: pho
MFC after: 2 weeks


232833 11-Mar-2012 kib

Remove not needed ARGSUSED lint command.

Submitted by: bde
MFC after: 3 days


232821 11-Mar-2012 kib

Remove fifo.h. The only used function declaration from the header is
migrated to sys/vnode.h.

Submitted by: gianni


232732 09-Mar-2012 pho

Revert r232692 as the correct place to fix this is at the syscall level.


232709 09-Mar-2012 kib

Decomission mnt_noasync. Introduce MNTK_NOASYNC mnt_kern_flag which
allows a filesystem to request VFS to not allow MNTK_ASYNC.

MFC after: 1 week


232701 08-Mar-2012 jhb

Add KTR_VFS traces to track modifications to a vnode's writecount.


232692 08-Mar-2012 pho

syscall() fuzzing can trigger this panic. Return EINVAL instead.

MFC after: 1 week


232401 02-Mar-2012 jhb

Similar to the fixes in 226967 and 226987, purge any name cache entries
associated with the previous vnode (if any) associated with the target of
a rename(). Otherwise, a lookup of the target pathname concurrent with a
rename() could re-add a name cache entry after the namei(RENAME) lookup
in kern_renameat() had purged the target pathname.

MFC after: 2 weeks


232351 01-Mar-2012 mckusick

This change avoids a kernel deadlock on "snaplk" when using
snapshots on UFS filesystems running with journaled soft updates.
This is the first of several bugs that need to be fixed before
removing the restriction added in -r230250 to prevent the use
of snapshots on filesystems running with journaled soft updates.

The deadlock occurs when holding the snapshot lock (snaplk)
and then trying to flush an inode via ffs_update(). We become
blocked by another process trying to flush a different inode
contained in the same inode block that we need. It holds the
inode block for which we are waiting locked. When it tries to
write the inode block, it gets blocked waiting for the our
snaplk when it calls ffs_copyonwrite() to see if the inode
block needs to be copied in our snapshot.

The most obvious place that this deadlock arises is in the
ffs_copyonwrite() routine when it updates critical metadata
in a snapshot and tries to write it out before proceeding.
The fix here is to write the data and indirect block pointer
for the snapshot, but to skip the call to ffs_update() to
write the snapshot inode. To ensure that we will never have
to update a pointer in the inode itself, the ffs_snapshot()
routine that creates the snapshot has to ensure that all the
direct blocks are allocated as part of the creation of the
snapshot.

A less obvious place that this deadlock occurs is when we hold
the snaplk because we are deleting a snapshot. In the course of
doing the deletion, we need to allocate various soft update
dependency structures and allocate some journal space. If we
hit a resource limit while doing this we decrease the resources
in use by flushing out an existing dirty file to get it to give
up the soft dependency resources that it holds. The flush can
cause an ffs_update() to be done on the inode for the file that
we have selected to flush resulting in the same deadlock as
described above when the inode that we have chosen to flush
resides in the same inode block as the snapshot inode that we hold.
The fix is to defer cleaning up any time that the inode on which
we are operating is a snapshot.

Help and review by: Jeff Roberson
Tested by: Peter Holm
MFC (to 9 only) after: 2 weeks


232003 22-Feb-2012 kib

Properly lock DQREF() with dqhlock. Missed locking caused counter
corruption.

Assert that the dq reference value is sane before decrementing it.

Reported and tested by: pho
MFC after: 1 week


231949 21-Feb-2012 kib

Fix found places where uio_resid is truncated to int.

Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the
sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from
the usermode.

Discussed with: bde, das (previous versions)
MFC after: 1 month


231572 13-Feb-2012 mckusick

Missing conditions in checking whether an inode has been written.

Found and tested by: Peter Holm
MFC after: 2 weeks (to 9 only)


231313 09-Feb-2012 mckusick

Historically when an application wrote an entire block of a file,
the kernel allocated a buffer but did not zero it as it was about
to be completely filled by a uiomove() from the user's buffer.
However, if the uiomove() failed, the old contents of the buffer
could be exposed especially if the file was being mmap'ed. The
fix was to always zero the buffer when it was allocated.

This change first attempts the uiomove() to the newly allocated
(and dirty) buffer and only zeros it if the uiomove() fails. The
effect is to eliminate the gratuitous zeroing of the buffer in
the usual case where the uiomove() successfully fills it.

Reviewed by: kib
Tested by: scottl
MFC after: 2 weeks (to 9 only)


231160 07-Feb-2012 mckusick

In the original days of BSD, a sync was issued on every filesystem
every 30 seconds. This spike in I/O caused the system to pause every
30 seconds which was quite annoying. So, the way that sync worked
was changed so that when a vnode was first dirtied, it was put on
a 30-second cleaning queue (see the syncer_workitem_pending queues
in kern/vfs_subr.c). If the file has not been written or deleted
after 30 seconds, the syncer pushes it out. As the syncer runs once
per second, dirty files are trickled out slowly over the 30-second
period instead of all at once by a call to sync(2).

The one drawback to this is that it does not cover the filesystem
metadata. To handle the metadata, vfs_allocate_syncvnode() is called
to create a "filesystem syncer vnode" at mount time which cycles
around the cleaning queue being sync'ed every 30 seconds. In the
original design, the only things it would sync for UFS were the
filesystem metadata: inode blocks, cylinder group bitmaps, and the
superblock (e.g., by VOP_FSYNC'ing devvp, the device vnode from
which the filesystem is mounted).

Somewhere in its path to integration with FreeBSD the flushing of
the filesystem syncer vnode got changed to sync every vnode associated
with the filesystem. The result of this change is to return to the
old filesystem-wide flush every 30-seconds behavior and makes the
whole 30-second delay per vnode useless.

This change goes back to the originally intended trickle out sync
behavior. Key to ensuring that all the intended semantics are
preserved (e.g., that all inode updates get flushed within a bounded
period of time) is that all inode modifications get pushed to their
corresponding inode blocks so that the metadata flush by the
filesystem syncer vnode gets them to the disk in a timely way.
Thanks to Konstantin Belousov (kib@) for doing the audit and commit
-r231122 which ensures that all of these updates are being made.

Reviewed by: kib
Tested by: scottl
MFC after: 2 weeks


231122 07-Feb-2012 kib

Sprinkle missed calls to asynchronous UFS_UPDATE() in attempt to
guarantee that all UFS inode metadata changes results in the dirtiness
of the inodeblock. Due to missed inodeblock updates, syncer was
required to fsync each mount point' vnode to guarantee periodic
metadata flush.

Reviewed by: mckusick
Tested by: scottl
MFC after: 2 weeks


231091 06-Feb-2012 kib

Add missing opt_quota.h include to activate #ifdef QUOTA blocks,
apparently a step in unbreaking QUOTA support.

Reported and tested by: Adam Strohl <adams-freebsd ateamsystems com>
MFC after: 1 week


231077 06-Feb-2012 kib

JNEWBLK dependency may legitimately appear on the buf dependency
list. If softdep_sync_buf() discovers such dependency, it should do
nothing, which is safe as it is only waiting on the parent buffer to
be written, so it can be removed.

Committed on behalf of: jeff
MFC after: 1 week


231075 06-Feb-2012 kib

Current implementations of sync(2) and syncer vnode fsync() VOP uses
mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which
is needed to guarantee a synchronous completion of the initiated i/o
before syscall or VOP return. Global removal of MNTK_ASYNC option is
harmful because not only i/o started from corresponding thread becomes
synchronous, but all i/o is synchronous on the filesystem which is
initiated during sync(2) or syncer activity.

Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local
thread flag to disable async i/o for current thread only. Use the
opportunity to move DOINGASYNC() macro into sys/vnode.h and
consistently use it through places which tested for MNTK_ASYNC.

Some testing demonstrated 60-70% improvements in run time for the
metadata-intensive operations on async-mounted UFS volumes, but still
with great deviation due to other reasons.

Reviewed by: mckusick
Tested by: scottl
MFC after: 2 weeks


230250 17-Jan-2012 mckusick

There are several bugs/hangs when trying to take a snapshot on a UFS/FFS
filesystem running with journaled soft updates. Until these problems
have been tracked down, return ENOTSUPP when an attempt is made to
take a snapshot on a filesystem running with journaled soft updates.

MFC after: 2 weeks


230249 17-Jan-2012 mckusick

Make sure all intermediate variables holding mount flags (mnt_flag)
and that all internal kernel calls passing mount flags are declared
as uint64_t so that flags in the top 32-bits are not lost.

MFC after: 2 weeks


230221 16-Jan-2012 ivoras

Add a bit of verbosity to the comment.


230101 14-Jan-2012 mckusick

Convert FFS mount error messages from kernel printf's to using the
vfs_mount_error error message facility provided by the nmount
interface.

Clean up formatting of mount warnings which still need to use
kernel printf's since they do not return errors.

Requested by: Craig Rodrigues <rodrigc@crodrigues.org>
MFC after: 2 weeks


229828 08-Jan-2012 kib

Avoid LOR between vfs_busy() lock and covered vnode lock on quotaon().
The vfs_busy() is after covered vnode lock in the global lock order, but
since quotaon() does recursive VFS call to open quota file, we usually
end up locking covered vnode after mp is busied in sys_quotactl().

Change the interface of VFS_QUOTACTL(), requiring that mp was unbusied
by fs code, and do not try to pick up vfs_busy() reference in ufs quotaon,
esp. if vfs_busy cannot succeed due to unmount being performed.

Reported and tested by: pho
MFC after: 1 week


229200 01-Jan-2012 ed

Migrate ufs and ext2fs from skpc() to memcchr().

While there, remove a useless check from the code. memcchr() always
returns characters unequal to 0xff in this case, so inosused[i] ^ 0xff
can never be equal to zero. Also, the fact that memcchr() returns a
pointer instead of the number of bytes until the end, makes conversion
to an offset far more easy.


227382 09-Nov-2011 gleb

Use implementation independent inoNN_t scalars for on-disk UFS structures

Approved by: mdf (mentor)


227309 07-Nov-2011 ed

Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.

The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.


227267 06-Nov-2011 ed

Remove MALLOC_DECLAREs of nonexisting malloc-pools.

After careful grepping, it seems none of these pools can be found in our
source tree. They are not in use, nor are they defined.


226971 31-Oct-2011 pho

Fix the wrong commit log message for r226967: "Added missing cache purge
of from argument" and fix the comment.


226967 31-Oct-2011 pho

The kern_renameat() looks up the fvp using the DELETE flag, which causes
the removal of the name cache entry for fvp.

Reported by: Anton Yuzhaninov <citrin citrin ru>
In collaboration with: kib
MFC after: 1 week


225807 27-Sep-2011 mckusick

This update eliminates a lock-order reversal warning discovered
whle tracking down the system hang reported in kern/160662 and
corrected in revision 225806. The LOR is not the cause of the system
hang and indeed cannot cause an actual deadlock. However, it can
be easily eliminated by defering the acquisition of a buflock until
after all the vnode locks have been acquired.

Reported by: Hans Ottevanger
PR: kern/160662


225806 27-Sep-2011 mckusick

This update eliminates the system hang reported in kern/160662 when
taking a snapshot on a filesystem running with journaled soft updates.

Reported by: Hans Ottevanger
Fix verified by: Hans Ottevanger
PR: kern/160662


225700 20-Sep-2011 kib

Use nowait sync request for a vnode when doing softdep cleanup. We possibly
own the unrelated vnode lock, doing waiting sync causes deadlocks.

Reported and tested by: pho
Approved by: re (bz)


225166 25-Aug-2011 mm

Generalize ffs_pages_remove() into vn_pages_remove().

Remove mapped pages for all dataset vnodes in zfs_rezget() using
new vn_pages_remove() to fix mmapped files changed by
zfs rollback or zfs receive -F.

PR: kern/160035, kern/156933
Reviewed by: kib, pjd
Approved by: re (kib)
MFC after: 1 week


225104 23-Aug-2011 ae

Fix lock leak.

Reported by: Alex Lyashkov
Approved by: re (kib)
MFC after: 1 week


224876 15-Aug-2011 rwatson

Fix two cases involving opt_capsicum.h and module builds:

(1) opt_capsicum.h is no longer required in ffs_alloc.c, so remove the
#include.

(2) portalfs depends on opt_capsicum.h, so have the Makefile generate one
if required.

These affect only modules built without a kernel (i.e, not buildkernel,
but yes buildworld if the dubious MODULES_WITH_WORLD is used).

Approved by: re (bz)
Sponsored by: Google Inc


224778 11-Aug-2011 rwatson

Second-to-last commit implementing Capsicum capabilities in the FreeBSD
kernel for FreeBSD 9.0:

Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *. With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.

Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.

In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.

Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.

Approved by: re (bz)
Submitted by: jonathan
Sponsored by: Google Inc


224503 30-Jul-2011 mckusick

Update to -r224294 to ensure that only one of MNT_SUJ or MNT_SOFTDEP
is set so that mount can revert back to using MNT_NOWAIT when doing
getmntinfo.

Approved by: re (kib)


224294 24-Jul-2011 mckusick

Move the MNTK_SUJ flag in mnt_kern_flag to MNT_SUJ in mnt_flag
so that it is visible to userland programs. This change enables
the `mount' command with no arguments to be able to show if a
filesystem is mounted using journaled soft updates as opposed
to just normal soft updates.

Approved by: re (bz)


224272 22-Jul-2011 mckusick

Default debugging error messages to off for journaled soft updates sysctls.
Delete limiting on output of these sysctls.

Approved by: re (kib)


224061 15-Jul-2011 mckusick

Add an FFS specific mount option to allow a filesystem checker
(typically fsck_ffs) to register that it wishes to use FFS specific
sysctl's to update the filesystem. This ensures that two checkers
cannot run on a given filesystem at the same time and that no other
process accidentally or maliciously uses the filesystem updating
sysctls inappropriately. This functionality is needed by the
journaling soft-updates recovery code.


224027 14-Jul-2011 mckusick

Consistently check mount flag (MNTK_SUJ) rather than superblock
flag (FS_SUJ) when determining whether to do journaling-based
operations. The mount flag is set only when journaling is active
while the superblock flag is set to indicate that journaling is to
be used. For example, when the filesystem is mounted read-only, the
journaling may be present (FS_SUJ) but not active (MNTK_SUJ).
Inappropriate checking of the FS_SUJ flag was causing some
journaling actions to be attempted at inappropriate times.


223902 10-Jul-2011 mckusick

When first creating snapshots, we may free some blocks within it.
These blocks should not have TRIM applied to them.

Submitted by: Kostik Belousov


223900 10-Jul-2011 mckusick

Allow disk partitions associated with UFS read-only mounted
filesystems to be opened for writing. This functionality used to
be special-cased for just the root filesystem, but with this change
is now available for all UFS filesystems. This change is needed for
journaled soft updates recovery.

Discussed with: Jeff Roberson


223888 09-Jul-2011 kib

Use 'curthread_pflags' instead of 'thread_pflags' to signify that only
curthread can be operated upon.

Requested by: attilio
MFC after: 1 week


223887 09-Jul-2011 kib

Use helper functions instead of manually managing TDP_INBDFLUSH.

Sponsored by: The FreeBSD Foundation
Reviewed by: alc (previous version)
MFC after: 1 week


223772 04-Jul-2011 jeff

- Speed up pendingblock processing again. Having too much delay between
ffs_blkfree() and the pending adjustment causes all kinds of
space related problems.


223771 04-Jul-2011 jeff

- Handle D_JSEGDEP in the softdep_sync_buf() switch. These can now
find themselves on snapshot vnodes.

Reported by: pho


223770 04-Jul-2011 jeff

- It is impossible to run request_cleanup() while doing a copyonwrite.
This will most likely cause new block allocations which can recurse
into request cleanup.
- While here optimize the ufs locking slightly. We need only acquire and
drop once.
- process_removes() and process_truncates() also is only needed once.
- Attempt to flush each item on the worklist once but do not loop forever
if some can not be completed.

Discussed with: mckusick


223769 04-Jul-2011 jeff

- Fix an inode quota leak. We need to decrement the quota once and only
once.

Tested by: pho
Reviewed by: mckusick


223687 29-Jun-2011 mckusick

Handle the FREEDEP case in softdep_sync_buf().
This fix failed to get added in -r223325.

Submitted by: Peter Holm


223677 29-Jun-2011 alc

Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this
option to vm_object_page_remove() asserts that the specified range of pages
is not mapped, or more precisely that none of these pages have any managed
mappings. Thus, vm_object_page_remove() need not call pmap_remove_all() on
the pages.

This change not only saves time by eliminating pointless calls to
pmap_remove_all(), but it also eliminates an inconsistency in the use of
pmap_remove_all() versus related functions, like pmap_remove_write(). It
eliminates harmless but pointless calls to pmap_remove_all() that were being
performed on PG_UNMANAGED pages.

Update all of the existing assertions on pmap_remove_all() to reflect this
change.

Reviewed by: kib


223325 20-Jun-2011 jeff

- Fix directory count rollbacks by passing the mode to the journal dep
earlier.
- Add rollback/forward code for frag and cluster accounting.
- Handle the FREEDEP case in softdep_sync_buf(). (submitted by pho)


223268 18-Jun-2011 mckusick

Fixed dereference of a NULL pointer.

Reported by: Peter Holm


223169 16-Jun-2011 mckusick

Drop the include of <ufs/ffs/ffs_extern.h> from usr.sbin/makefs/ffs/ffs_bswap.c
and usr.sbin/makefs/ffs/ffs_subr.c as they have no need of anything in that
file. No other programs or libraries include <ufs/ffs/ffs_extern.h> (nor
should they as it is totally in-kernel interfaces). For added protection
I enclosed the entire contents of <ufs/ffs/ffs_extern.h> in ifdef _KERNEL.

Feedback from: Bruce Evans and Tai-hwa Liang


223138 16-Jun-2011 avatar

Fixing compilation bustage by introducing another forward declaration.


223127 15-Jun-2011 mckusick

Ensure that filesystem metadata contained within persistent snapshots
is always kept consistent.

Suggested by: Jeff Roberson


223114 15-Jun-2011 mckusick

With the restructuring of the block reclaimation code, the notification
messages for a filesystem being out of space need to be moved so that
they do not print out until after a failed cleanup attempt.

Suggested by: Jeff Roberson


223105 15-Jun-2011 mckusick

Missing cleanup case after completion of a snapshot vnode write
claiming a released block.

Submitted by: Jeff Roberson
Tested by: Peter Holm


223052 13-Jun-2011 dim

Use alternative, less messy solution to avoid breakage after r223020:
put the snapdata structure between #ifdef _KERNEL guards.

Suggested by: kib


223020 12-Jun-2011 mckusick

Update to soft updates journaling to properly track freed blocks
that get claimed by snapshots.

Submitted by: Jeff Roberson
Tested by: Peter Holm


223018 12-Jun-2011 mckusick

Disable the soft updates journaling after a filesystem is successfully
downgraded to read-only. It will be restarted if the filesystem is
upgraded back to read-write.


222958 10-Jun-2011 jeff

Implement fully asynchronous partial truncation with softupdates journaling
to resolve errors which can cause corruption on recovery with the old
synchronous mechanism.

- Append partial truncation freework structures to indirdeps while
truncation is proceeding. These prevent new block pointers from
becoming valid until truncation completes and serialize truncations.
- On completion of a partial truncate journal work waits for zeroed
pointers to hit indirects.
- softdep_journal_freeblocks() handles last frag allocation and last
block zeroing.
- vtruncbuf/ffs_page_remove moved into softdep_*_freeblocks() so it
is only implemented in one place.
- Block allocation failure handling moved up one level so it does not
proceed with buf locks held. This permits us to do more extensive
reclaims when filesystem space is exhausted.
- softdep_sync_metadata() is broken into two parts, the first executes
once at the start of ffs_syncvnode() and flushes truncations and
inode dependencies. The second is called on each locked buf. This
eliminates excessive looping and rollbacks.
- Improve the mechanism in process_worklist_item() that handles
acquiring vnode locks for handle_workitem_remove() so that it works
more generally and does not loop excessively over the same worklist
items on each call.
- Don't corrupt directories by zeroing the tail in fsck. This is only
done for regular files.
- Push a fsync complete record for files that need it so the checker
knows a truncation in the journal is no longer valid.

Discussed with: mckusick, kib (ffs_pages_remove and ffs_truncate parts)
Tested by: pho


222955 10-Jun-2011 jeff

- Add support for referencing quota structures without needing the inode
pointer for softupdates.

Submitted by: mckusick


222954 10-Jun-2011 jeff

- If the fsync in ufs_direnter fails SUJ can later panic because we have
partially added a name. Allow ufs_direnter() to continue in the
hopes that it is a transient error. If it is not, the directory
is corrupted already from IO errors and writing this new block
is not likely to make things worse.


222724 05-Jun-2011 mckusick

Grammer fix in comment.

Eliminate one (of several) possible conflicting buffer locks when
trying to reclaim blocks. Rest of fix to be incorporated as part
of SUJ update by jeff.

Pointed out by: Kostik Belousov


222422 28-May-2011 mckusick

Due to a lag in updating the fs_pendinginodes count, we cannot depend
on it to decide whether we should try to reclaim inodes when we run
short.

Discovered by: Peter Holm


222334 26-May-2011 mckusick

The check for whether a block is going to be claimed by a snapshot
needs to happen before we notify the underlying layer that it is
being freed.


222196 22-May-2011 rmacklem

Fix the ufs/ffs file system so that it uses the lock
flags argument added to VFS_FHTOVP() by r222167.

Reviewed by: mckusick


222167 22-May-2011 rmacklem

Add a lock flags argument to the VFS_FHTOVP() file system
method, so that callers can indicate the minimum vnode
locking requirement. This will allow some file systems to choose
to return a LK_SHARED locked vnode when LK_SHARED is specified
for the flags argument. This patch only adds the flag. It
does not change any file system to use it and all callers
specify LK_EXCLUSIVE, so file system semantics are not changed.

Reviewed by: kib


221829 13-May-2011 mdf

Use a name instead of a magic number for kern_yield(9) when the priority
should not change. Fetch the td_user_pri under the thread lock. This
is probably not necessary but a magic number also seems preferable to
knowing the implementation details here.

Requested by: Jason Behmer < jason DOT behmer AT isilon DOT com >


221281 30-Apr-2011 kib

Fix typos.

Noted by: Fabian Keil <freebsd-listen fabiankeil de>
Pointy hat to: kib
MFC after: 1 week


221261 30-Apr-2011 kib

Clarify the comment.

MFC after: 1 week


220985 24-Apr-2011 kib

VFS sometimes is unable to inactivate a vnode when vnode use count
goes to zero. E.g., the vnode might be only shared-locked at the time of
vput() call. Such vnodes are kept in the hash, so they can be found later.

If ffs_valloc() allocated an inode that has its vnode cached in hash, and
still owing the inactivation, then vget() call from ffs_valloc() clears
VI_OWEINACT, and then the vnode is reused for the newly allocated inode.

The problem is, the vnode is not reclaimed before it is put to the new
use. ffs_valloc() recycles vnode vm object, but this is not enough.
In particular, at least v_vflag should be cleared, and several bits of
UFS state need to be removed.

It is very inconvenient to call vgone() at this point. Instead, move
some parts of ufs_reclaim() into helper function ufs_prepare_reclaim(),
and call the helper from VOP_RECLAIM and ffs_valloc().

Reviewed by: mckusick
Tested by: pho
MFC after: 3 weeks


220532 11-Apr-2011 jeff

- Refactor softdep_setup_freeblocks() into a set of functions to prepare
for a new journal specific partial truncate routine.
- Use dep_current[] in place of specific dependency counts. This is
automatically maintained when workitems are allocated and has
less risk of becoming incorrect.


220511 10-Apr-2011 jeff

Fix a long standing SUJ performance problem:

- Keep a hash of indirect blocks that have recently been freed and are
still referenced in the journal.
- Lookup blocks in this hash before forcing a new block write to wait on
the journal entry to hit the disk. This is only necessary to avoid
confusion between old identities as indirects and new identities as
file blocks.
- Don't free jseg structures until the journal has written a record that
invalidates it. This keeps the indirect block information around for
as long as is required to be safe.
- Force an empty journal block write when required to flush out stale
journal data that is simply waiting for the oldest valid sequence
number to advance beyond it.


220406 07-Apr-2011 jeff

- Don't invalidate jnewblks immediately upon discovering that the block
will be removed. Permit the journal to proceed so that we don't leave
a rollback in a cg for a very long time as this can cause terrible perf
problems in low memory situations.

Tested by: pho


220374 05-Apr-2011 mckusick

Be far more persistent in reclaiming blocks and inodes before giving
up and declaring a filesystem out of space. Especially necessary when
running on a small filesystem. With this improvement, it should be
possible to use soft updates on a small root filesystem.

Kudos to: Peter Holm
Testing by: Peter Holm
MFC: 2 weeks


220282 02-Apr-2011 jeff

Fix problems that manifested from filesystem full conditions:

- In softdep_revert_mkdir() find the dotaddref before we attempt to cancel
the jaddref so we can make assumptions about where the dotaddref is on
the list. cancel_jaddref() does not always remove items from the list
anymore.
- Always set GOINGAWAY on an inode in softdep_freefile() if DEPCOMPLETE
was never set. This ensures that dependencies will continue to be
processed on the inowait/bufwait list and is more an artifact of
the structure of the code than a pure ordering problem.
- Always set DEPCOMPLETE on canceled jaddrefs so that they can be freed
appropriately. This normally occurs when the refs are added to the
journal but if they are canceled before this point the state would
never be set and the dependency could never be freed.

Reported by: pho
Tested by: pho


220099 28-Mar-2011 kib

Fix the softdep_request_cleanup() function definition for !SOFTUPDATES case.

Submitted by: Aleksandr Rybalko <ray dlink ua>


219895 23-Mar-2011 mckusick

Add retry code analogous to the block allocation retry code
to avoid running out of inodes.

Reported by: Peter Holm


219804 20-Mar-2011 kib

Retire opt_ffs_broken_fixme.h.
Instead of directly calling ffs_snapgone(), use UFS_SNAPGONE() with
usual layering.

Requested by: bde
MFC after: 1 week


219712 17-Mar-2011 kib

Remove the #if defined(FFS) || defined(IFS) braces around the calls to
ffs_snapgone(). ufs.ko module is not build with FFS define, causing
snapshot inode number slots in superblock never be freed, as well as a
reference on the snapshot vnode.

IFS was removed several years ago, and UFS/FFS separation was not
maintained for real.

Reported, analyzed and tested by: Yamagi Burmeister <lists yamagi org>
MFC after: 3 days


219388 07-Mar-2011 kib

Simplify uses of the web of pointers.

Reviewed by: mckusick
MFC after: 1 week


219384 07-Mar-2011 jhb

The UFS dirhash code was attempting to update shared state in the dirhash
from multiple threads while holding a shared lock during a lookup operation.
This could result in incorrect ENOENT failures which could then be
permanently stored in the name cache.

Specifically, the dirhash code optimizes the case that a single thread is
walking a directory sequentially opening (or stat'ing) each file. It uses
state in the dirhash structure to determine if a given lookup is using the
optimization. If the optimization fails, it disables it and restarts the
lookup. The problem arises when two threads both attempt the optimization
and fail. The first thread will restart the loop, but the second thread
will incorrectly think that it did not try the optimization and will only
examine a subset of the directory entires in its hash chain. As a result,
it may fail to find its directory entry and incorrectly fail with ENOENT.

To make this safe for use with shared locks, simplify the state stored in
the dirhash and move some of the state (the part that determines if the
current thread is trying the optimization) into a local variable. One
result is that we will now try the optimization more often. We still
update the value under the shared lock, but it is a single atomic store
similar to i_diroff that is stored in UFS directory i-nodes for the
non-dirhash lookup.

Reviewed by: kib
MFC after: 1 week


219276 04-Mar-2011 jhb

Use ffs() to locate free bits in the inode bitmap rather than a loop with
bit shifts.

Reviewed by: mckusick
MFC after: 1 month


218838 19-Feb-2011 kib

v_mountedhere is a member of the union. Check that the vnodes have
proper type before using the member.

Reported and tested by: Michael Butler <imb protected-networks net>


218602 12-Feb-2011 kib

Use the native sector size of the device backing the UFS volume for SU+J
journal blocks, instead of hard coding 512 byte sector size. Journal need
to atomically write the block, that can only be guaranteed at the device
sector size, not larger. Attempt to write less then sector size results in
driver errors.

Note that this is the first structure in UFS that depends on the
sector size. Other elements are written in the units of fragments.

In collaboration with: pho
Reviewed by: jeff
Tested by: bz, pho


218513 10-Feb-2011 netchild

Wrap long line.

Noticed by: bz


218485 09-Feb-2011 netchild

Add some FEATURE macros for some UFS features.

SU+J is not included as a FEATURE macro:
- it was not in the tree during the GSoC
- I do not see an option to en-/disable it in NOTES

Two minor changes where made during the review compared to what was developed
during GSoC 2010.

No FreeBSD version bump, the userland application to query the features will
be committed last and can serve as an indication of the availablility if
needed.

Sponsored by: Google Summer of Code 2010
Submitted by: kibab
Reviewed by: kib
X-MFC after: to be determined in last commit with code from this project


218424 08-Feb-2011 mdf

Based on discussions on the svn-src mailing list, rework r218195:

- entirely eliminate some calls to uio_yeild() as being unnecessary,
such as in a sysctl handler.

- move should_yield() and maybe_yield() to kern_synch.c and move the
prototypes from sys/uio.h to sys/proc.h

- add a slightly more generic kern_yield() that can replace the
functionality of uio_yield().

- replace source uses of uio_yield() with the functional equivalent,
or in some cases do not change the thread priority when switching.

- fix a logic inversion bug in vlrureclaim(), pointed out by bde@.

- instead of using the per-cpu last switched ticks, use a per thread
variable for should_yield(). With PREEMPTION, the only reasonable
use of this is to determine if a lock has been held a long time and
relinquish it. Without PREEMPTION, this is essentially the same as
the per-cpu variable.


218195 02-Feb-2011 mdf

Put the general logic for being a CPU hog into a new function
should_yield(). Use this in various places. Encapsulate the common
case of check-and-yield into a new function maybe_yield().

Change several checks for a magic number of iterations to use
should_yield() instead.

MFC after: 1 week


217357 13-Jan-2011 pluknet

Embed a quota error message (C string) into uprintf() fmt.
While here, fix whitespaces.

Approved by: kib (mentor)


217326 12-Jan-2011 mdf

sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly.

Commit the kernel changes.


216951 04-Jan-2011 kib

Instead of incrementing freework reference counter in indir_trunc(), do
it at the allocation time for journaled fs and indirect blocks, when
the allocated object is not accessible outside.

Requested and reviewed by: jeff
Tested by: pho


216818 30-Dec-2010 kib

Handle missing jremrefs when a directory is renamed overtop of
another, deleting it. If the directory is removed, UFS always need to
remove the .. ref, even if the ultimate ref on the parent would not
change. The new directory must have a new journal entry for that ref.
Otherwise journal processing would not properly account for the
parent's reference since it will belong to a removed directory entry.

Change ufs_rename()'s dotdot rename section to always
setup_dotdot_link(). In the tip != NULL case SUJ needs the newref dependency
allocated via setup_dotdot_link().

Stop setting isrmdir to 2 for newdirrem() in softdep_setup_remove().
Remove the isdirrem > 1 checks from newdirrem().

Reported by: many
Submitted by: jeff
Tested by: pho


216817 30-Dec-2010 kib

In indir_trunc(), when processing jnewblk entries that are not written
to the disk, recurse to handle indirect blocks of next level that are
hidden by the corresponding entry.

In collaboration with: pho
Reviewed by: jeff, mckusick
Tested by: mckusick, pho


216796 29-Dec-2010 kib

Add kernel side support for BIO_DELETE/TRIM on UFS.

The FS_TRIM fs flag indicates that administrator requested issuing of
TRIM commands for the volume. UFS will only send the command to disk
if the disk reports GEOM::candelete attribute.

Since disk queue is reordered, data block is marked as free in the bitmap
only after TRIM command completed. Due to need to sleep waiting for
i/o to finish, TRIM bio_done routine schedules taskqueue to set the
bitmap bit.

Based on the patch by: mckusick
Reviewed by: mckusick, pjd
Tested by: pho
MFC after: 1 month


216795 29-Dec-2010 kib

Move the definition of mkdirlisthd from header to C file.

Reviewed by: mckusick
Tested by: pho


216792 29-Dec-2010 kib

Use a proper type for the variable holding the summary size of the inode
data. Otherwise, on 32bit systems, unlinked inode which size is the
multiple of 4GB was not truncated, causing corruption.

Reported by: brucec
Reviewed by: mckusick
Tested by: pho


216676 23-Dec-2010 mckusick

This patch fixes a soft update panic while running perl 5.12 tests
which produced:

panic: indir_trunc: Index out of range -148 parent -2061 lbn -305164

Reported by: Dimitry Andric
Fixed by: Jeff Roberson


216099 01-Dec-2010 kib

Journal start looks up .sujournal file by doing lookup on the root dvp.
As result, failed softdep_mount() might leave up to two vnodes on the
mp mountlist, preventing mnt_ref from going to zero.

Call ffs_flushfiles() after failed softdep_mount() to clean mountlist.

Initial report by: Garrett Cooper
Reproduced and tested by: pho


215950 27-Nov-2010 pho

First step in fixing the handle_workitem_freeblocks panic.

In collaboration with: kib


215576 20-Nov-2010 mckusick

Delete /sys/ufs/ffs/README.snapshot as it is no longer relevant.
Drop reference to it in mount(8).

MFC: 3 days


215548 19-Nov-2010 kib

Remove prtactive variable and related printf()s in the vop_inactive
and vop_reclaim() methods. They seems to be unused, and the reported
situation is normal for the forced unmount.

MFC after: 1 week
X-MFC-note: keep prtactive symbol in vfs_subr.c


215117 11-Nov-2010 kib

The softdep_setup_freeblocks() adds worklist items before
deallocate_dependencies() is done. This opens a race between softdep
thread and the thread that does the truncation:
A write of the indirect block causes the freeblks to become
ALLCOMPLETE while softdep_setup_freeblocks() dropped softdep lock. And
then, softdep_disk_write_complete() would reassign the workitem to the
mount point worklist, causing premature processing of the workitem, or
journal write exhaust the fb_jfreeblkhd and handle_written_jfreeblk does
the same reassign.
indir_trunc() then would find the indirect block that is locked (with lock
owned by kernel) but without any dependencies, causing it to hang in
getblk() waiting for buffer lock.

Do not mark freeblks as DEPCOMPLETE until deallocate_dependencies()
finished.

Analyzed, suggested and reviewed by: jeff
Tested by: pho


215115 11-Nov-2010 kib

Change #ifdef INVARIANTS panic into KASSERT, and print some useful
information to diagnose the issue, in handle_complete_freeblocks().

Reviewed by: jeff
Tested by: pho


215114 11-Nov-2010 kib

In journal_mount(), only set MNTK_SUJ flag after the jblocks are mapped.
I believe there is a window otherwise where jblocks can be accessed
without proper initialization.

Reviewed by: jeff
Tested by: pho


215113 11-Nov-2010 kib

Add function lbn_offset to calculate offset of the indirect block of
given level.

Reviewed by: jeff
Tested by: pho


215112 11-Nov-2010 kib

Fix typo. Function is called ffs_blkfree.


215052 09-Nov-2010 jhb

Remove unused includes of <sys/mutex.h> and <machine/mutex.h>.


214359 25-Oct-2010 ivoras

Bring vfs.ufs.dirhash_maxmem into the age of the fruitbat and make it
autotuned. It is only an upper bound (the memory is not always allocated)
and the system contains a vm_lowmem handler so nothing will crash and burn
if it's tuned too high.

Reviewed by: mckusick


213664 10-Oct-2010 kib

The r184588 changed the layout of struct export_args, causing an ABI
breakage for old mount(2) syscall, since most struct <filesystem>_args
embed export_args. The mount(2) is supposed to provide ABI
compatibility for pre-nmount mount(8) binaries, so restore ABI to
pre-r184588.

Requested and reviewed by: bde
MFC after: 2 weeks


213363 02-Oct-2010 alc

M_USE_RESERVE has been deprecated for a decade. Eliminate any uses that
have no run-time effect.


213275 29-Sep-2010 mckusick

Since local variable 'i' is used only in a KASSERT, declare and
initialize it only if INVARIANTS is defined to avoid a declared
but unused warning.

Suggested by: Brian Somers <brian@FreeBSD.org>


213259 29-Sep-2010 kib

Fix typo in comment.


212788 17-Sep-2010 obrien

Correct some non-code typos.


212617 14-Sep-2010 mckusick

Update comments in soft updates code to more fully describe
the addition of journalling. Only functional change is to
tighten a KASSERT.

Reviewed by: jeff Roberson


211531 20-Aug-2010 jhb

Add dedicated routines to toggle lockmgr flags such as LK_NOSHARE and
LK_CANRECURSE after a lock is created. Use them to implement macros that
otherwise manipulated the flags directly. Assert that the associated
lockmgr lock is exclusively locked by the current thread when manipulating
these flags to ensure the flag updates are safe. This last change required
some minor shuffling in a few filesystems to exclusively lock a brand new
vnode slightly earlier.

Reviewed by: kib
MFC after: 3 days


211212 12-Aug-2010 kib

Softdep_process_worklist() should unsuspend not only before processing
the worklist (in softdep_process_journal), but also after flushing the
workitems. Might be, we should even do this before bwillwrite() too, but
this seems to be not needed for now.

Fs might be suspended during processing the queue, and then there is
nobody around to unsuspend.

In collaboration with: pho
Tested by: bz
Reviewed by: jeff


210172 16-Jul-2010 jhb

Revert the previous commit. The race is not applicable to the lockmgr
implementation in 8.0 and later as its flags field does not hold dynamic
state such as waiters flags, but is only modified in lockinit() aside
from VN_LOCK_*().

Discussed with: attilio


210171 16-Jul-2010 jhb

When the MNTK_EXTENDED_SHARED mount option was added, some filesystems were
changed to defer the setting of VN_LOCK_ASHARE() (which clears LK_NOSHARE
in the vnode lock's flags) until after they had determined if the vnode was
a FIFO. This occurs after the vnode has been inserted a VFS hash or some
similar table, so it is possible for another thread to find this vnode via
vget() on an i-node number and block on the vnode lock. If the lockmgr
interlock (vnode interlock for vnode locks) is not held when clearing the
LK_NOSHARE flag, then the lk_flags field can be clobbered. As a result
the thread blocked on the vnode lock may never get woken up. Fix this by
holding the vnode interlock while modifying the lock flags in this case.

MFC after: 3 days


209717 06-Jul-2010 jeff

- Handle the truncation of an inode with an effective link count of 0 in
the context of the process that reduced the effective count. Previously
all truncation as a result of unlink happened in the softdep flush
thread. This had the effect of being impossible to rate limit properly
with the journal code. Now the process issuing unlinks is suspended
when the journal files. This has a side-effect of improving rm
performance by allowing more concurrent work.
- Handle two cases in inactive, one for effnlink == 0 and another when
nlink finally reaches 0.
- Eliminate the SPACECOUNTED related code since the truncation is no
longer delayed.

Discussed with: mckusick


209367 20-Jun-2010 kib

Ensure that VOP_ACCESSX is called with exclusively locked vnode for
the kernel compiled with QUOTA option. ufs_accessx() upgrades the vdp
vnode lock from shared to exclusive to assign the dquot structure to
the vnode, and ufs_delete_denied() is called when tvp is locked. Since
upgrade drops shared lock when non-blocked upgrade failed, LOR is there.

Reported and tested by: Dmitry Pryanishnikov <lynx.ripe gmail com>
Tested by: pho
PR: kern/147890
MFC after: 1 week


209057 11-Jun-2010 avg

ffs_softdep: change K&R in function defintions to ANSI prototypes

Apparently it's bad when we first have an ANSI prototype in function
declaration, but then use K&R in its defintion.

Complaint from: clang
MFC after: 2 weeks


208774 03-Jun-2010 kib

Extend the scope of the lock on the quota file vnode in quotaon() to
cover the initial read by dqopen(). Assert that vnode is locked in
dqopen(). Remove VFS_LOCK_GIANT() from dqopen(), since quotaon() keeps
Giant locked if needed around the call.


208293 19-May-2010 avg

ffs_mount: accept and drop userland-only options that can be passed from
loader(8)

In r193192 loader(8) has grown an ability to pass root mount options
from fstab via vfs.root.mountfrom.options. Unfortunately, some options
that can be present in fstab are for userland only and lead to root
mounting failure when seen by kernel.
Rather than teaching loader about FFS-specific options that should be
filtered out, ffs_mount recognizes those options as valid, but ignores
and deletes[1] them.

[1] is suggested by jh.

PR: kern/141050
Reported by: many
Reviewed by: jh, bde
MFC after: 4 days


208287 19-May-2010 jeff

- Don't immediately re-run softdepflush if we didn't make any progress
on the last iteration. This can lead to a deadlock when we have
worklist items that cannot be immediately satisfied.

Reported by: uqs, Dimitry Andric <dimitry@andric.com>

- Remove some unnecessary debugging code and place some other under
SUJ_DEBUG.
- Examine the journal state in softdep_slowdown().
- Re-format some comments so I may more easily add flag descriptions.


207742 07-May-2010 jeff

- Call softdep_prealloc() before any of the balloc routines in the
snapshot code.
- Don't fsync() vnodes in prealloc if copy on write is in progress. It
is not safe to recurse back into the write path here.

Reported by: Vladimir Grebenschikov <vova@fbsd.ru>


207741 07-May-2010 jeff

- Use the correct flag mask when determining whether an inode has
successfully made it to the free list yet or not. This fixes
a deadlock that can occur with unlinked but referenced files.
Journal space and inodedeps were not correctly reclaimed because
the inode block was not left dirty.

Tested/Reported by: lwindschuh@googlemail.com


207736 07-May-2010 mckusick

Merger of the quota64 project into head.

This joint work of Dag-Erling Smørgrav and myself updates the
FFS quota system to support both traditional 32-bit and new 64-bit
quotas (for those of you who want to put 2+Tb quotas on your users).

By default quotas are not compiled into the kernel. To include them
in your kernel configuration you need to specify:

options QUOTA # Enable FFS quotas

If you are already running with the current 32-bit quotas, they
should continue to work just as they have in the past. If you
wish to convert to using 64-bit quotas, use `quotacheck -c 64';
if you wish to revert from 64-bit quotas back to 32-bit quotas,
use `quotacheck -c 32'.

There is a new library of functions to simplify the use of the
quota system, do `man quotafile' for details. If your application
is currently using the quotactl(2), it is highly recommended that
you convert your application to use the quotafile interface.
Note that existing binaries will continue to work.

Special thanks to John Kozubik of rsync.net for getting me
interested in pursuing 64-bit quota support and for funding
part of my development time on this project.


207728 06-May-2010 alc

Eliminate page queues locking around most calls to vm_page_free().


207669 05-May-2010 alc

Acquire the page lock around all remaining calls to vm_page_free() on
managed pages that didn't already have that lock held. (Freeing an
unmanaged page, such as the various pmaps use, doesn't require the page
lock.)

This allows a change in vm_page_remove()'s locking requirements. It now
expects the page lock to be held instead of the page queues lock.
Consequently, the page queues lock is no longer required at all by callers
to vm_page_rename().

Discussed with: kib


207662 05-May-2010 trasz

Move checking against RLIMIT_FSIZE into one place, vn_rlimit_fsize().

Reviewed by: kib


207366 29-Apr-2010 avg

ffs_vfsops: restore alphabetic order of options in ffs_opts

The order was not correct only for nfsv4acls.
("no" prefix is ignored)

MFC after: 1 week


207310 28-Apr-2010 jeff

- When canceling jaddrefs they may not yet be in the journal if this is via
a revert call. In this case don't attempt to remove something that
has not yet been added. Otherwise this jaddref must hang around
to prevent the bitmap write as normal.


207309 28-Apr-2010 jeff

- Fix builds without SOFTUPDATES defined in the kernel config.


207142 24-Apr-2010 pjd

Fix build for UFS without SOFTUPDATES.


207141 24-Apr-2010 jeff

- Merge soft-updates journaling from projects/suj/head into head. This
brings in support for an optional intent log which eliminates the need
for background fsck on unclean shutdown.

Sponsored by: iXsystems, Yahoo!, and Juniper.
With help from: McKusick and Peter Holm


206894 20-Apr-2010 kib

The cache_enter(9) function shall not be called for doomed dvp.
Assert this.

In the reported panic, vdestroy() fired the assertion "vp has namecache
for ..", because pseudofs may end up doing cache_enter() with reclaimed
dvp, after dotdot lookup temporary unlocked dvp.
Similar problem exists in ufs_lookup() for "." lookup, when vnode
lock needs to be upgraded.

Verify that dvp is not reclaimed before calling cache_enter().

Reported and tested by: pho
Reviewed by: kan
MFC after: 2 weeks


206128 03-Apr-2010 avg

ffs_mount: remove redundant assignment of geom consumer to devvp.v_bufobj

The assignment is already done in g_vfs_open.
Redundant assignment is harmless, but can become a problem if g_vfs_open
logic is changed.

MFC after: 1 week


203818 13-Feb-2010 kib

When ffs_realloccg() failed to allocate bigger fragment and, because
pending blocks are scheduled for removal, goes to retry the (re)allocation,
clear the bp pointer. It might happen that meantime free space is really
exhausted and we are entering nospace: label without bread()ing buffer,
causing stale bp value to be brelse()d again.

Tested by: pho
(Producing a scenario to reliably reproduce the
race appeared to be much harder then fixing the bug)
MFC after: 1 week


203784 11-Feb-2010 mckusick

One last pass to get all the unsigned comparisons correct.


203763 10-Feb-2010 mckusick

This fix corrects a problem in the file system that treats large
inode numbers as negative rather than unsigned. For a default
(16K block) file system, this bug began to show up at a file system
size above about 16Tb.

To fully handle this problem, newfs must be updated to ensure that
it will never create a filesystem with more than 2^32 inodes. That
patch will be forthcoming soon.

Reported by: Scott Burns, John Kilburg, Bruce Evans
Followup by: Jeff Roberson
PR: 133980
MFC after: 2 weeks


203761 10-Feb-2010 trasz

Remove unused variable.


202971 25-Jan-2010 trasz

Return proper error code.

Found with: clang


202934 24-Jan-2010 trasz

Move out code that does POSIX.1e ACL inheritance into separate routines.

Reviewed by: rwatson


202125 11-Jan-2010 mckusick

Cast 64-bit quantity to intptr_t rather than int so as to work properly
with 64-bit architectures (such as amd64).

Reported by: bz


202113 11-Jan-2010 mckusick

Background:

When renaming a directory it passes through several intermediate
states. First its new name will be created causing it to have two
names (from possibly different parents). Next, if it has different
parents, its value of ".." will be changed from pointing to the old
parent to pointing to the new parent. Concurrently, its old name
will be removed bringing it back into a consistent state. When fsck
encounters an extra name for a directory, it offers to remove the
"extraneous hard link"; when it finds that the names have been
changed but the update to ".." has not happened, it offers to rewrite
".." to point at the correct parent. Both of these changes were
considered unexpected so would cause fsck in preen mode or fsck in
background mode to fail with the need to run fsck manually to fix
these problems. Fsck running in preen mode or background mode now
corrects these expected inconsistencies that arise during directory
rename. The functionality added with this update is used by fsck
running in background mode to make these fixes.

Solution:

This update adds three new fsck sysctl commands to support background
fsck in correcting expected inconsistencies that arise from incomplete
directory rename operations. They are:

setcwd(dirinode) - set the current directory to dirinode in the
filesystem associated with the snapshot.
setdotdot(oldvalue, newvalue) - Verify that the inode number for ".."
in the current directory is oldvalue then change it to newvalue.
unlink(nameptr, oldvalue) - Verify that the inode number associated
with nameptr in the current directory is oldvalue then unlink it.

As with all other fsck sysctls, these new ones may only be used by
processes with appropriate priviledge.

Reported by: jeff
Security issues: rwatson


201758 07-Jan-2010 mbr

Remove extraneous semicolons, no functional changes.

Submitted by: Marc Balmer <marc@msys.ch>
MFC after: 1 week


201717 07-Jan-2010 mckusick

KASSERT that condition raised by Coverity cannot happen.

Found by: Coverity Prevent (tm)
KASSERT by: sam


200796 21-Dec-2009 trasz

Implement NFSv4 ACL support for UFS.

Reviewed by: rwatson


200770 21-Dec-2009 kib

VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object
flag. Besides providing the redundand information, need to update both
vnode and object flags causes more acquisition of vnode interlock.
OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects.

Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for
vnode-backed vm objects.

Suggested and reviewed by: alc
Tested by: pho
MFC after: 3 weeks


197408 22-Sep-2009 rdivacky

Don't build ufs_gjournal.c at all if UFS_GJOURNAL option is not given
instead of building an almost empty C file.

Approved by: pjd
Approved by: ed (mentor, implicit)


197269 17-Sep-2009 brooks

Allocate space for the group array in a static credential used in
the quota code. One case was correctly handled in r194498, but
this one was missed.

PR: kern/138657
Tested by: PR submitter
MFC after: 3 days


196987 08-Sep-2009 trasz

Remove useless variable assignment.


196920 07-Sep-2009 kib

insmntque_stddtr() clears vp->v_data and resets vp->v_op to
dead_vnodeops before calling vgone(). Revert r189706 and corresponding
part of the r186560.

Noted and reviewed by: tegge
Approved by: des (pseudofs part)
MFC after: 3 days


196888 06-Sep-2009 kib

The clear_remove() and clear_inodedeps() call vn_start_write(NULL, &mp,
V_NOWAIT) on the non-busied mount point. Unmount might free ufs-specific
mp data, causing ffs_vgetf() to access freed memory.

Busy mountpoint before dropping softdep lk.

Noted and reviewed by: tegge
Tested by: pho
MFC after: 1 week


196206 14-Aug-2009 kib

When a UFS node is truncated to the zero length, e.g. by explicit
truncate(2) call, or by being removed or truncated on open, either
new softupdate freeblks structure is allocated to track the freed
blocks of the node, or truncation is done syncronously when too many SU
dependencies are accumulated. The decision does not take into account
the allocated freeblks dependencies, allowing workloads that do huge
amount of truncations to exhaust the kernel memory.

Take the number of allocated freeblks into consideration for
softdep_slowdown().

Reported by: pluknet gmail com
Diagnosed and tested by: pho
Approved by: re (rwatson)
MFC after: 1 month


195296 02-Jul-2009 trasz

Fix fpathconf(3) on fifos, in effect making ls(1) properly
display '+' on them. Taken from kern/125613, with cosmetic
changes.

PR: kern/125613
Submitted by: Jaakko Heinonen <jh at saunalahti dot fi>
Approved by: re (kib)


195294 02-Jul-2009 kib

In vn_vget_ino() and their inline equivalents, mnt_ref() the mount point
around the sequence that drop vnode lock and then busies the mount point.
Not having vlocked node or direct reference to the mp allows for the
forced unmount to proceed, making mp unmounted or reused.

Tested by: pho
Reviewed by: jeff
Approved by: re (kensmith)
MFC after: 2 weeks


195265 01-Jul-2009 trasz

Don't panic on attempt to set ACL on a block device file.
This is just a part of kern/125613.

PR: kern/125613
Submitted by: Jaakko Heinonen <jh at saunalahti dot fi>
Reviewed by: rwatson
Approved by: re (kib)


195187 30-Jun-2009 kib

For SU mounts, softdep_fsync() might drop vnode lock, allowing other
threads to put dirty buffers on the vnode bufobj list. For regular files
and synchronous fsync requests, check for the condition and restart the
fsync vop if a new dirty buffer arrived.

Tested by: pho
Approved by: re (kensmith)
MFC after: 1 month


195186 30-Jun-2009 kib

Softdep_fsync() may need to lock parent directory of the synced vnode.
Use inlined (due to FFSV_FORCEINSMQ) version of vn_vget_ino() to prevent
mountpoint from being unmounted and freed while no vnodes are locked.

Tested by: pho
Approved by: re (kensmith)
MFC after: 1 month


195003 25-Jun-2009 snb

Fix a bug reported by pho@ where one can induce a panic by decreasing
vfs.ufs.dirhash_maxmem below the current amount of memory used by dirhash. When
ufsdirhash_build() is called with the memory in use greater than dirhash_maxmem,
it attempts to free up memory by calling ufsdirhash_recycle(). If successful in
freeing enough memory, ufsdirhash_recycle() leaves the dirhash list locked. But
at this point in ufsdirhash_build(), the list is not explicitly unlocked after
the call(s) to ufsdirhash_recycle(). When we next attempt to lock the dirhash
list, we will get a "panic: _mtx_lock_sleep: recursed on non-recursive mutex
dirhash list".

Tested by: pho
Approved by: dwmalone (mentor)
MFC after: 3 weeks


194498 19-Jun-2009 brooks

Rework the credential code to support larger values of NGROUPS and
NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024
and 1023 respectively. (Previously they were equal, but under a close
reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it
is the number of supplemental groups, not total number of groups.)

The bulk of the change consists of converting the struct ucred member
cr_groups from a static array to a pointer. Do the equivalent in
kinfo_proc.

Introduce new interfaces crcopysafe() and crsetgroups() for duplicating
a process credential before modifying it and for setting group lists
respectively. Both interfaces take care for the details of allocating
groups array. crsetgroups() takes care of truncating the group list
to the current maximum (NGROUPS) if necessary. In the future,
crsetgroups() may be responsible for insuring invariants such as sorting
the supplemental groups to allow groupmember() to be implemented as a
binary search.

Because we can not change struct xucred without breaking application
ABIs, we leave it alone and introduce a new XU_NGROUPS value which is
always 16 and is to be used or NGRPS as appropriate for things such as
NFS which need to use no more than 16 groups. When feasible, truncate
the group list rather than generating an error.

Minor changes:
- Reduce the number of hand rolled versions of groupmember().
- Do not assign to both cr_gid and cr_groups[0].
- Modify ipfw to cache ucreds instead of part of their contents since
they are immutable once referenced by more than one entity.

Submitted by: Isilon Systems (initial implementation)
X-MFC after: never
PR: bin/113398 kern/133867


194387 17-Jun-2009 snb

Keep dirhash tailq locked throughout the entirety of ufsdirhash_destroy() to fix
a potential race pointed out by pjd. Also use TAILQ_FOREACH_SAFE to iterate over
dirhashes in ufsdirhash_lowmem(), so that we can continue iterating even after a
dirhash is destroyed.

Suggested by: pjd
Tested by: pho
Approved by: dwmalone (mentor)


194296 16-Jun-2009 kib

Do not use casts (int *)0 and (struct thread *)0 for the arguments of
vn_rdwr, use NULL.

Reviewed by: jhb
MFC after: 1 week


193511 05-Jun-2009 rwatson

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


193375 03-Jun-2009 snb

Add vm_lowmem event handler for dirhash. This will cause dirhashes to be
deleted when the system is low on memory. This ought to allow an increase to
vfs.ufs.dirhash_maxmem on machines that have lots of memory, without
degrading performance by having too much memory reserved for dirhash when
other things need it. The default value for dirhash_maxmem is being kept at
2MB for now, though.

This work was mostly done during the 2008 Google Summer of Code.

Approved by: dwmalone (mentor), re
MFC after: 3 months


193307 02-Jun-2009 attilio

Handle lock recursion differenty by always checking against LO_RECURSABLE
instead the lock own flag itself.

Tested by: pho


192895 27-May-2009 jamie

Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails. Child jails may be restricted more than their parents,
but never less. Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system. Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings. The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.

Approved by: bz (mentor)


192586 22-May-2009 trasz

Make 'struct acl' larger, as required to support NFSv4 ACLs. Provide
compatibility interfaces in both kernel and libc.

Reviewed by: rwatson


192260 17-May-2009 alc

Introduce vfs_bio_set_valid() and use it from ffs_realloccg(). This
eliminates the misuse of vfs_bio_clrbuf() by ffs_realloccg().

In collaboration with: tegge


191990 11-May-2009 attilio

Remove the thread argument from the FSD (File-System Dependent) parts of
the VFS. Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.

In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.

While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.

VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled. Bump __FreeBSD_version in order to signal such
situation.


191940 09-May-2009 kan

Do not embed struct ucred into larger netcred parent structures.

Credential might need to hang around longer than its parent and be used
outside of mnt_explock scope controlling netcred lifetime. Use separate
reference-counted ucred allocated separately instead.

While there, extend mnt_explock coverage in vfs_stdexpcheck and clean-up
some unused declarations in new NFS code.

Reported by: John Hickey
PR: kern/133439
Reviewed by: dfr, kib


191564 27-Apr-2009 rmacklem

Change the semantics of i_modrev/va_filerev to what is required for
the nfsv4 Change attribute. There are 2 changes:
1 - The value now changes on metadata changes as well as data
modifications (incremented for IN_CHANGE instead of IN_UPDATE).
2 - It is now saved in spare space in the on-disk i-node so that it
survives a crash.
Since va_filerev is not passed out into user space, the only current
use of va_filerev is in the nfs server, which uses it as the directory
cookie verifier. Since this verifier is only passed back to the server
by a client verbatim and then the server doesn't check it, changing the
semantics should not break anything currently in FreeBSD.

Reviewed by: bde
Approved by: kib (mentor)


191315 20-Apr-2009 kib

In ufs_checkpath(), recheck that '..' still points to the inode with
the same inode number after VFS_VGET() and relock of the vp. If '..'
changed, redo the lookup. To reduce code duplication, move the code to
read '..' dirent into the static helper function ufs_dir_dd_ino().

Supply the source inode number as an argument to ufs_checkpath() instead
of the source inode itself. The inode is unlocked, thus it might be
reclaimed, causing accesses to the freed memory.

Use vn_vget_ino() to get the '..' vnode by its inode number, instead of
directly code VFS_VGET() and relock, to properly busy the mount point
while vp lock is dropped.

Noted and reviewed by: tegge
Tested by: pho
MFC after: 1 month


191260 19-Apr-2009 kib

When verifying '..' after VFS_VGET() in ufs_lookup(), do not return
error if '..' is still there but changed between lookup and check.
Start relookup instead. Rename is supposed to change '..' reference
atomically, so transient failures introduced by r191137 are wrong.

While rearranging the code to allow lookup restart in ufs_lookup(),
remove the comment that only distracts the reader.

Noted and reviewed by: tegge
Also reported by: pho
MFC after: 1 month


191249 18-Apr-2009 trasz

Use acl_alloc() and acl_free() instead of using uma(9) directly.
This will make switching to malloc(9) easier; also, it would be
neccessary to add these routines if/when we implement variable-size
ACLs.


191137 16-Apr-2009 kib

Verify that '..' still exists with the same inode number after
VFS_VGET() has returned in ufs_lookup(). If the '..' lookup started
immediately before the parent directory was removed, we might return
either cleared or unrelated inode otherwise.

Ufs_lookup() is split into new function ufs_lookup_() that either does
lookup, or verifies that directory entry exists and references supplied
inode number.

Reviewed by: tegge
Tested by: pho,
Andreas Tobler <andreast-list fgznet ch> (previous version)
MFC after: 1 month


190888 10-Apr-2009 rwatson

Remove VOP_LEASE and supporting functions. This hasn't been used since
the removal of NQNFS, but was left in in case it was required for NFSv4.
Since our new NFSv4 client and server can't use it for their
requirements, GC the old mechanism, as well as other unused lease-
related code and interfaces.

Due to its impact on kernel programming and binary interfaces, this
change should not be MFC'd.

Proposed by: jeff
Reviewed by: jeff
Discussed with: rmacklem, zach loafman @ isilon


190690 04-Apr-2009 kib

When removing or renaming snaphost, do not delve into request_cleanup().
The later may need blocks from the underlying device that belongs
to normal files, that should not be locked while snap lock is held.

Reported and tested by: pho
MFC after: 1 month


190469 27-Mar-2009 kib

Correct typo.

Noted by: kensmith


189878 16-Mar-2009 kib

Fix two issues with bufdaemon, often causing the processes to hang in
the "nbufkv" sleep.

First, ffs background cg group block write requests a new buffer for
the shadow copy. When ffs_bufwrite() is called from the bufdaemon due
to buffers shortage, requesting the buffer deadlock bufdaemon.
Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk
to not block while allocating the buffer, and return failure
instead. Add a flag argument to the geteblk to allow to pass the flags
to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer
allocation failed and either GB_NOWAIT_BD is specified, or geteblk()
is called from bufdaemon (or its helper, see below). In
ffs_bufwrite(), fall back to synchronous cg block write if shadow
block allocation failed.

Since r107847, buffer write assumes that vnode owning the buffer is
locked. The second problem is that buffer cache may accumulate many
buffers belonging to limited number of vnodes. With such workload,
quite often threads that own the mentioned vnodes locks are trying to
read another block from the vnodes, and, due to buffer cache
exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make
any substantial progress because the vnodes are locked.

Allow the threads owning vnode locks to help the bufdaemon by doing
the flush pass over the buffer cache before getnewbuf() is going to
uninterruptible sleep. Move the flushing code from buf_daemon() to new
helper function buf_do_flush(), that is called from getnewbuf(). The
number of buffers flushed by single call to buf_do_flush() from
getnewbuf() is limited by new sysctl vfs.flushbufqtarget. Prevent
recursive calls to buf_do_flush() by marking the bufdaemon and threads
that temporarily help bufdaemon by TDP_BUFNEED flag.

In collaboration with: pho
Reviewed by: tegge (previous version)
Tested by: glebius, yandex ...
MFC after: 3 weeks


189737 12-Mar-2009 kib

The non-modifying EA VOPs are executed with only shared vnode lock taken.
Provide a custom lock around initializing and tearing down EA area,
to prevent both memory leaks and double-free of it. Count the number
of EA area accessors.

Lock protocol requires either holding exclusive vnode lock to modify
i_ea_area, or shared vnode lock and owning IN_EA_LOCKED flag in i_flag.

Noted by: YAMAMOTO, Taku <taku tackymt homeip net>
Tested by: pho (previous version)
MFC after: 2 weeks


189706 11-Mar-2009 kib

Do not double-free the struct inode when insmntque failed. Default
insmntque destructor reclaims the vnode, and ufs_reclaim frees the memory.

Reviewed by: tegge
MFC after: 3 days


189696 11-Mar-2009 jhb

Add a new internal mount flag (MNTK_EXTENDED_SHARED) to indicate that a
filesystem supports additional operations using shared vnode locks.
Currently this is used to enable shared locks for open() and close() of
read-only file descriptors.
- When an ISOPEN namei() request is performed with LOCKSHARED, use a
shared vnode lock for the leaf vnode only if the mount point has the
extended shared flag set.
- Set LOCKSHARED in vn_open_cred() for requests that specify O_RDONLY but
not O_CREAT.
- Use a shared vnode lock around VOP_CLOSE() if the file was opened with
O_RDONLY and the mountpoint has the extended shared flag set.
- Adjust md(4) to upgrade the vnode lock on the vnode it gets back from
vn_open() since it now may only have a shared vnode lock.
- Don't enable shared vnode locks on FIFO vnodes in ZFS and UFS since
FIFO's require exclusive vnode locks for their open() and close()
routines. (My recent MPSAFE patches for UDF and cd9660 already included
this change.)
- Enable extended shared operations on UFS, cd9660, and UDF.

Submitted by: ups
Reviewed by: pjd (ZFS bits)
MFC after: 1 month


189595 09-Mar-2009 jhb

Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints. Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see
such problems. There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf. I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.

Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.

MFC after: 1 month


188956 23-Feb-2009 trasz

Right now, when trying to unmount a device that's already gone,
msdosfs_unmount() and ffs_unmount() exit early after getting ENXIO.
However, dounmount() treats ENXIO as a success and proceeds with
unmounting. In effect, the filesystem gets unmounted without closing
GEOM provider etc.

Reviewed by: kib
Approved by: rwatson (mentor)
Tested by: dho
Sponsored by: FreeBSD Foundation


188954 23-Feb-2009 trasz

Refactor, moving error checking outside of the
'if (mp->mnt_flag & MNT_SOFTDEP)' conditional. No functional
changes.

Reviewed by: kib
Approved by: rwatson (mentor)
Tested by: pho
Sponsored by: FreeBSD Foundation


188501 11-Feb-2009 jhb

- If the g_access() call for the initial root mount fails, then fully
cleanup. Before the GEOM consumer would not have been closed.
- Bump the reference on the character device being mounted while the
associated devfs vnode is locked.

Reviewed by: kib


188240 06-Feb-2009 trasz

When a device containing mounted UFS filesystem disappears, the type
of devvp becomes VBAD, which UFS incorrectly interprets as snapshot
vnode, which in turns causes panic. Fix it by replacing '!= VCHR'
with '== VREG'.

With this fix in place, you should no longer be able to panic the system
by removing a device with an UFS filesystem mounted from it - assuming
you don't use softupdates.

Reviewed by: kib
Tested by: pho
Approved by: rwatson (mentor)
Sponsored by: FreeBSD Foundation


187894 29-Jan-2009 trasz

Make sure the cdev doesn't go away while the filesystem is still mounted.
Otherwise dev2udev() could return garbage.

Reviewed by: kib
Approved by: rwatson (mentor)
Sponsored by: FreeBSD Foundation


187790 27-Jan-2009 rwatson

Following a fair amount of real world experience with ACLs and
extended attributes since FreeBSD 5, make the following semantic
changes:

- Don't update the inode modification time (mtime) when extended
attributes (and hence also ACLs) are added, modified, or removed.
- Don't update the inode access tie (atime) when extended attributes
(and hence also ACLs) are queried.

This means that rsync (and related tools) won't improperly think
that the data in the file has changed when only the ACL has changed.

Note that ffs_reallocblks() has not been changed to not update on an
IO_EXT transaction, but currently EAs don't use the cluster write
routines so this shouldn't be a problem. If EAs grow support for
clustering, then VOP_REALLOCBLKS() will need to grow a flag argument
to carry down IO_EXT to UFS.

MFC after: 1 week
PR: ports/125739
Reported by: Alexander Zagrebin <alexz@visp.ru>
Tested by: pluknet <pluknet@gmail.com>,
Greg Byshenk <freebsd@byshenk.net>
Discussed with: kib, kientzle, timur, Alexander Bokovoy <ab@samba.org>


187564 21-Jan-2009 jhb

Fix a few style bogons.

Submitted by: bde


187528 21-Jan-2009 kib

Move the code from ufs_lookup.c used to do dotdot lookup, into
the helper function. It is supposed to be useful for any filesystem
that has to unlock dvp to walk to the ".." entry in lookup routine.

Requested by: jhb
Tested by: pho
MFC after: 1 month


187526 21-Jan-2009 jhb

Move the VA_MARKATIME flag for VOP_SETATTR() out into its own VOP:
VOP_MARKATIME() since unlike the rest of VOP_SETATTR(), VA_MARKATIME
can be performed while holding a shared vnode lock (the same functionality
is done internally by VOP_READ which can run with a shared vnode lock).
Add missing locking of the vnode interlock to the ufs implementation and
remove a special note and test from the NFS client about not supporting the
feature.

Inspired by: ups
Tested by: pho


187490 20-Jan-2009 kib

The r187467 should remove all pages for V_NORMAL case too, because
indirect block pages are not removed by the mentioned invocation of
the vnode_pager_setsize().

Put a common code into the helper function ffs_pages_remove().

Reported and tested by: dchagin
Reviewed by: ups
MFC after: 3 weeks


187474 20-Jan-2009 jhb

Add a comment explaining why the "bufwait" / "dirhash" LOR reported by
WITNESS will not actually result in a deadlock.

Discussed with: kib
MFC after: 1 week


187468 20-Jan-2009 kib

When extending inode size, we call vnode_pager_setsize(), to have a
address space where to put vnode pages, and then call UFS_BALLOC(),
to actually allocate new block and map it. When UFS_BALLOC() returns
error, sometimes we forget to revert the vm object size increase,
allowing for the pages that are not backed by the logical disk blocks.

Revert vnode_pager_setsize() back when UFS_BALLOC() failed, for
ffs_truncate() and ffs_write().

PR: 129956
Reviewed by: ups
MFC after: 3 weeks


187467 20-Jan-2009 kib

FFS puts the extended attributes blocks at the negative blocks for the
vnode, from -1 down. When vinvalbuf(vp, V_ALT) is done for the vnode, it
incorrectly does vm_object_page_remove(0, 0), removing all pages from
the underlying vm object, not only the pages that back the extended
attributes data.

Change vinvalbuf() to not remove any pages from the object when
V_NORMAL or V_ALT are specified. Instead, the only in-tree caller
in ffs_inode.c:ffs_truncate() that specifies V_ALT explicitely
removes the corresponding page range. The V_NORMAL caller
does vnode_pager_setsize(vp, 0) immediately after the call to
vinvalbuf(V_NORMAL) already.

Reported by: csjp
Reviewed by: ups
MFC after: 3 weeks


186898 08-Jan-2009 kib

Lock the uepm_lock around the autostart of extattrs.

Reported and tested by: pho
Reviewed by: rwatson
MFC after: 3 weeks


186897 08-Jan-2009 kib

If unmount of the ffs mp failed, reinitialize the extended attributes
for the mp, and restart them if autostart is enabled.

Reported and tested by: pho
Reviewed by: rwatson
MFC after: 3 weeks


186278 18-Dec-2008 kib

Do not busy twice the mount point where a quota operation is performed.

Tested by: pho
MFC after: 1 month


186194 16-Dec-2008 trasz

According to phk@, VOP_STRATEGY should never, _ever_, return
anything other than 0. Make it so. This fixes
"panic: VOP_STRATEGY failed bp=0xc320dd90 vp=0xc3b9f648",
encountered when writing to an orphaned filesystem. Reason
for the panic was the following assert:
KASSERT(i == 0, ("VOP_STRATEGY failed bp=%p vp=%p", bp, bp->b_vp));
at vfs_bio:bufstrategy().

Reviewed by: scottl, phk
Approved by: rwatson (mentor)
Sponsored by: FreeBSD Foundation


185761 08-Dec-2008 kib

The dqrele() function syncs the dq, then acquires the dqh lock, and then
does final drop of the the dq reference to put it onto the free list.
There is a possibility that the dq would be found by another thread
after sync and before the dqh lock is acquired. If that other thread
drops the dq before we have taken the dqh lock, the dirty dq is put on
the free list.

Recheck the DQ_MOD after the dqh lock is relocked. Repeat dqsync() if
the dq is dirty. This ensures that up to date dq is written in the quota
file and fixes assertion in dqget().

Reported and tested by: Frode Nordahl <frode nordahl net>
MFC after: 3 days


185739 07-Dec-2008 kib

Improve usefulness of the panic by printing the pointer to the problematic
dquot. In-tree gdb is often unable to get the dq value, so supply it in
panic message.

MFC after: 3 days


185556 02-Dec-2008 kib

Do not lock vnode interlock around reading of v_iflag to check VI_DOOMED.
Read of the pointer is atomic, and flag cannot be set while vnode lock
is held.

Requested by: jhb
MFC after: 1 month


185170 22-Nov-2008 kib

Busy ufs filesystem around block of code that does ".." lookup. Since
mnt_lock is before lock of any vnode on the mp, it uses LK_NOWAIT. Since
MNTK_UNMOUNT may be transient, pdp lock is dropped when vfs_busy()
failed, and operation is retried after some time. This way, ffs_vget()
is not called on the mp that may be in the process of being destroyed by
unmount.

Check for the VI_DOOMED flag on pdp after its lock is reacquired, to
better detect some situations where directory containing ".."
entry is removed during the lookup.

Reviewed by: tegge, attilio (previous version)
Tested by: pho
MFC after: 1 month


185102 19-Nov-2008 jhb

Fix typo.


184934 13-Nov-2008 ambrisko

For now on every 10 cyclinder groups flush the buffer cache to free
up space. If the buffer cache fills up then the disk systems can
grind to a halt. Better tuning can be figured out later.

Tested by: Tim, others and work
Reviewed by: Kostik Belousov
PR: 128832


184651 04-Nov-2008 jhb

Quiet a WITNESS warning with the dirhash sx locks by setting the DUPOK
flag. Specifically, if two threads race to create a dirhash for a
directory, then one might already have created a private dirhash
structure (and locked it) when it realizes the directory now has a
structure and tries to lock that one.


184629 04-Nov-2008 trasz

In UFS, when reading EA that contains ACL fails for some reason, include
inode number and filesystem name, so the administrator can fix the problem.

Approved by: rwatson (mentor)


184554 02-Nov-2008 attilio

Improve VFS locking:
- Implement real draining for vfs consumers by not relying on the
mnt_lock and using instead a refcount in order to keep track of lock
requesters.
- Due to the change above, remove the mnt_lock lockmgr because it is now
useless.
- Due to the change above, vfs_busy() is no more linked to a lockmgr.
Change so its KPI by removing the interlock argument and defining 2 new
flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the
old version (which was unlinked from the lockmgr alredy) and
MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx
once the mnt interlock is held (ability still desired by most consumers).
- The stub used into vfs_mount_destroy(), that allows to override the
mnt_ref if running for more than 3 seconds, make it totally useless.
Remove it as it was thought to work into older versions.
If a problem of "refcount held never going away" should appear, we will
need to fix properly instead than trust on such hackish solution.
- Fix a bug where returning (with an error) from dounmount() was still
leaving the MNTK_MWAIT flag on even if it the waiters were actually
woken up. Just a place in vfs_mount_destroy() is left because it is
going to recycle the structure in any case, so it doesn't matter.
- Remove the markercnt refcount as it is useless.

This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and
__FreeBSD_version will be modified accordingly.

Discussed with: kib
Tested by: pho


184413 28-Oct-2008 trasz

Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary
to add more V* constants, and the variables changed by this patch were often
being assigned to mode_t variables, which is 16 bit.

Approved by: rwatson (mentor)


184408 28-Oct-2008 kib

Provide an explanation for getinoquota() call in the ufs_access vop.

MFC after: 3 days


184214 23-Oct-2008 des

Fix a number of style issues in the MALLOC / FREE commit. I've tried to
be careful not to fix anything that was already broken; the NFSv4 code is
particularly bad in this respect.


184205 23-Oct-2008 des

Retire the MALLOC and FREE macros. They are an abomination unto style(9).

MFC after: 3 months


184074 20-Oct-2008 kib

Assert that v_holdcnt is non-zero before entering lockmgr in vn_lock
and ffs_lock. This cannot catch situations where holdcnt is incremented
not by curthread, but I think it is useful.

Reviewed by: tegge, attilio
Tested by: pho
MFC after: 2 weeks


183822 13-Oct-2008 kib

Sync up summary information for cylinder groups while data is already
in memory during snapshot creation. This improves the results of the
background fsck.

Submitted by: tegge
MFC after: 1 week


183754 10-Oct-2008 attilio

Remove the struct thread unuseful argument from bufobj interface.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()

and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()

Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.

As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP

Reviewed by: kib
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


183331 24-Sep-2008 jhb

Enable shared lookups on UFS. There are some remaining issues with forced
unmounts, but those are in the VFS lookup code are not UFS specific.

Tested by: pho, kris


183280 22-Sep-2008 jhb

Close a race between concurrent calls to ufsdirhash_recycle() and
ufsdirhash_free() introduced in my last commit by removing the dirhash
about to be free'd in ufsdirhash_free() from the global dirhash list
before dropping the sx lock.

Tested by: kris


183212 20-Sep-2008 kib

Initialize va_flags and va_filerev properly in VOP_GETATTR(). Don't
initialize va_vaflags and va_spare because they are not part of the
VOP_GETATTR() API. Also don't initialize birthtime to ctime or zero.

Submitted by: Jaakko Heinonen <jh saunalahti fi>
Reviewed by: bde
Discussed on: freebsd-fs
MFC after: 1 month


183093 16-Sep-2008 jhb

Retire the 'i_reclen' field from the in-memory i-node. Previously,
during a DELETE lookup operation, lookup would cache the length of the
directory entry to be deleted in 'i_reclen'. Later, the actual VOP to
remove the directory entry (ufs_remove, ufs_rename, etc.) would call
ufs_dirremove() which extended the length of the previous directory
entry to "remove" the deleted entry.

However, we always read the entire block containing the directory
entry when doing the removal, so we always have the directory entry to
be deleted in-memory when doing the update to the directory block.
Also, we already have to figure out where the directory entry that is
being removed is in the block so that we can pass the component name
to the dirhash code to update the dirhash. So, instead of passing
'i_reclen' from ufs_lookup() to the ufs_dirremove() routine, just read
the 'd_reclen' field directly out of the entry being removed when
updating the length of the previous entry in the block.

This avoids a cosmetic issue of writing to 'i_reclen' while holding a
shared vnode lock. It also slightly reduces the amount of side-band
data passed from ufs_lookup() to operations updating a directory via
the directory's i-node.

Reviewed by: jeff


183080 16-Sep-2008 jhb

Fix a race with shared lookups on UFS. If the the dirhash code reached the
cap on memory usage, then shared LOOKUP operations could start free'ing
dirhash structures. Without these fixes, concurrent free's on the same
directory could result in one of the threads blocked on a lock in a dirhash
structure free'd by the other thread.
- Replace the lockmgr lock in the dirhash structure with an sx lock.
- Use a reference count managed with ufsdirhash_hold()/drop() to determine
when to free the dirhash structures. The directory i-node holds a
reference while the dirhash is attached to an i-node. Code that wishes
to lock the dirhash while holding a shared vnode lock must first
acquire a private reference to the dirhash while holding the vnode
interlock before acquiring the dirhash sx lock. After acquiring the sx
lock, it drops the private reference after checking to see if the
dirhash is still used by the directory i-node.


183079 16-Sep-2008 jhb

- Only set i_offset in the parent directory's i-node during a lookup for
non-LOOKUP operations.
- Relax a VOP assertion for a DELETE lookup. rename() uses WANTPARENT
instead of LOCKPARENT when looking up the source pathname. ufs_rename()
uses a relookup() to lock the parent directory when it decides to finally
remove the source path. Thus, it is ok for a DELETE with WANTPARENT set
instead of LOCKPARENT to use a shared vnode lock rather than an exclusive
vnode lock.

Reported by: kris (2)
Reviewed by: jeff


183078 16-Sep-2008 jhb

vdropl() drops the vnode interlock. Thus, the code in the QUOTA case that
upgrades the vnode lock if it is share locked was dropping the interlock
before actually checking VI_DOOMED. Fix this by do the vdropl() after the
check and relying on it to drop the vnode interlock.

Reported by: pho
Reviewed by: kib
MFC after: 1 week


183074 16-Sep-2008 kib

Suspend the write operations on the UFS filesystem being unmounted or
remounted from rw to ro.

Proposed and reviewed by: tegge
In collaboration with: pho
MFC after: 1 month


183073 16-Sep-2008 kib

When attempt is made to suspend a filesystem that is already syspended,
wait until the current suspension is lifted instead of silently returning
success immediately. The consequences of calling vfs_write() resume when
not owning the suspension are not well-defined at best.

Add the vfs_susp_clean() mount method to be called from
vfs_write_resume(). Set it to process_deferred_inactive() for ffs, and
stop calling it manually.

Add the thread flag TDP_IGNSUSP that allows to bypass the suspension
point in the vn_start_write. It is intended for use by VFS in the
situations where the suspender want to do some i/o requiring calls to
vn_start_write(), and this i/o cannot be done later.

Reviewed by: tegge
In collaboration with: pho
MFC after: 1 month


183072 16-Sep-2008 kib

Add the ffs structures introspection functions for ddb.
Show the b_dep value for the buffer in the show buffer command.
Add a comand to dump the dirty/clean buffer list for vnode.

Reviewed by: tegge
Tested and used by: pho
MFC after: 1 month


183070 16-Sep-2008 kib

When downgrading the read-write mount to read-only, do_unmount() sets
MNT_RDONLY flag before the VFS_MOUNT() is called. In ufs_inactive()
and ufs_itimes_locked(), UFS verifies whether the fs is read-only by
checking MNT_RDONLY, but this may cause loss of the IN_MODIFIED flag
for inode on the fs being remounted rw->ro.

Introduce UFS_RDONLY() struct ufsmount' method that reports the value
of the fs_ronly. The later is set to 1 only after the remount is
finished.

Reviewed by: tegge
In collaboration with: pho
MFC after: 1 month


183067 16-Sep-2008 kib

The struct inode *ip supplied to softdep_freefile is not neccessary the
inode having number ino. In r170991, the ip was marked IN_MODIFIED, that
is not quite correct.

Mark only the right inode modified by checking inode number.

Reviewed by: tegge
In collaboration with: pho
MFC after: 1 month


182721 03-Sep-2008 trasz

When calling extattr_check_cred, use V{READ,WRITE}, not I{READ,WRITE}.

Approved by: rwatson (mentor)


182542 31-Aug-2008 attilio

Decontextualize vfs_busy(), vfs_unbusy() and vfs_mount_alloc() functions.

Manpages are updated accordingly.

Tested by: Diego Sardina <siarodx at gmail dot com>


182371 28-Aug-2008 attilio

Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


182366 28-Aug-2008 kib

In ffs_valloc(), ffs_vget() may fail because insmntque() refused to
insert new vnode into the mount vnode list. Then, for the SU-enabled
mount, ffs_vfree could create freefile dependency. This dependency can
hang around forever since inode is not marked as IN_MODIFIED and
correspondingly inodeblock may be not marked as dirty.

After ffs_vget() fails, retry with FFSV_FORCEINSMQ, mark the inode as
modified, and vput() it immediately. Take care of the dup alloc.

Tested by: pho
Reviewed by: tegge
MFC after: 1 month


182365 28-Aug-2008 kib

Softdep code may need to instantiate vnode when processing
dependencies. In particular, it may need this while syncing filesystem
being unmounted. Since during unmount MNTK_NOINSMNTQUE flag is set,
that could sometimes disallow insertion of the vnode into the vnode
mount list, softdep code needs to overwrite the MNTK_NOINSMNTQUE flag.

Create the ffs_vgetf() function that sets the VV_FORCEINSMQ flag for
new vnode and use it consistently from the softdep code instead of
ffs_vget().

Add the retry logic to the softdep_flushfiles() to flush the vnodes
that could be instantiated while flushing softdep dependencies.

Tested by: pho, kris
Reviewed by: tegge
MFC after: 1 month


182115 24-Aug-2008 kib

Put the relocked variable from the r182111 into the #ifdef QUOTA braces
to prevent warning about unused var on the !QUOTA kernels.

Reported by: ed
MFC after: 1 week


182111 24-Aug-2008 kib

Revert the r167541: "Remove unneeded getinoquota() call in the
ufs_access()." The call to getinoquota in ufs_access() serves the
purpose of instantiating inode dquot from the vn_open(). Since quotas
are accounted only for the inodes with already attached dquot, removal
of the call prevented opened inodes from participation in the quota
calculations.

Since ufs_access() may be called with the vnode being only shared
locked, upgrade (and then downgrade) vnode lock if calling
getinoquota().

Reported by: simon at optinet com
In collaboration with: pho
MFC after: 1 week


181528 10-Aug-2008 kib

Revert r181345.
Move the NULL pointer check to the vfs_deleteopt() function.

Discussed with: rodrigc
MFC after: 3 days


181345 06-Aug-2008 kib

User may do "mount -o snapshot ...", that causes new FFS mount to be
performed with snapshot option, while the mp->mnt_opt is NULL.
Protect against NULL pointer dereference.

Noted by: Mateusz Guzik <mjguzik gmail com>
MFC after: 3 days


181329 05-Aug-2008 des

ufsmount.h uses "struct\tfoo *bar;", except where it doesn't.
quota.h uses "struct foo\t*bar;", except where it doesn't.
Try to make them both agree with themselves (though not with eachother)


181327 05-Aug-2008 des

Whitespace, prototypes


181018 30-Jul-2008 jhb

Whitespace tweak.


180758 23-Jul-2008 kib

The ffs_balloc_ufs{1,2} functions call bdwrite() while having several
vnode buffers locked at once. In particular, there are indirect buffers
among locked ones. The bdwrite() may start the flushing to keep dirty
buffer list at the bounds. If any buffer on the dirty list requires
translation from logical to physical block number, code may ends up
trying to lock an indirect buffer already locked in ffs_balloc_ufsX.

Prevent the bdflush() activity when several buffers are locked at once
by setting the TDP_INBDFUSH for the problematic code blocks.

Reported and tested by: pho, Josef Buchsteiner at Juniper
In collaboration with: kan
MFC after: 1 month


180621 19-Jul-2008 pjd

Say hi to svn, by simplifing ffs_vget() function a bit - there is no need for
a variable that is used only once.


179295 24-May-2008 rodrigc

Fix comments to replace SBSIZE with SBLOCKSIZE, since SBSIZE
was renamed to SBLOCKSIZE in version 1.33

Reviewed by: mckusick


179270 24-May-2008 rodrigc

After converting the "snapshot" mount option to the MNT_SNAPSHOT flag,
delete "snapshot" from the persistent mount options list.
This should fix problems with doing a mount -o snapshot of a file system, followed by
an NFS export of the same file system.

PR: 122833
Reported by: Leon Kos <leon.kos lecad fs uni-lj si>,
Jaakko Heinonen <jh saunalahti fi>
MFC after: 1 month


179269 24-May-2008 rodrigc

For the following mount options, do not perform the string to flag conversions
here, because we already do them further up in vfs_donmount() in vfs_mount.c

async -> MNT_ASYNC
force -> MNT_FORCE
multilabel -> MNT_MULTILABEL
noatime -> MNT_NOATIME
noclusterr -> MNT_NOCLUSTERR
noclusterw -> MNT_NOCLUSTERW

MFC after: 1 month


179159 20-May-2008 ups

Allow VM object creation in ufs_lookup. (If vfs.vmiodirenable is set)
Directory IO without a VM object will store data in 'malloced' buffers
severely limiting caching of the data. Without this change VM objects for
directories are only created on an open() of the directory.
TODO: Inline test if VM object already exists to avoid locking/function call
overhead.

Tested by: kris@
Reviewed by: jeff@
Reported by: David Filo


178420 22-Apr-2008 jeff

- Use a local variable for i_ino in ufs_lookup. It is only used to
communicate between two parts of this one function. This was causing
problems with shared lookups as each would trash the ino value in the
inode.
- Remove the unused i_ino field from the inode structure.


178243 16-Apr-2008 kib

Move the head of byte-level advisory lock list from the
filesystem-specific vnode data to the struct vnode. Provide the
default implementation for the vop_advlock and vop_advlockasync.
Purge the locks on the vnode reclaim by using the lf_purgelocks().
The default implementation is augmented for the nfs and smbfs.
In the nfs_advlock, push the Giant inside the nfs_dolock.

Before the change, the vop_advlock and vop_advlockasync have taken the
unlocked vnode and dereferenced the fs-private inode data, racing with
with the vnode reclamation due to forced unmount. Now, the vop_getattr
under the shared vnode lock is used to obtain the inode size, and
later, in the lf_advlockasync, after locking the vnode interlock, the
VI_DOOMED flag is checked to prevent an operation on the doomed vnode.

The implementation of the lf_purgelocks() is submitted by dfr.

Reported by: kris
Tested by: kris, pho
Discussed with: jeff, dfr
MFC after: 2 weeks


178110 11-Apr-2008 jeff

- Use a lockmgr lock rather than a mtx to protect dirhash. This lock
may be held for the duration of the various dirhash operations which
avoids many complex unlock/lock/revalidate sequences.
- Permit shared locks on lookup. To protect the ip->i_dirhash pointer we
use the vnode interlock in the shared case. Callers holding the
exclusive vnode lock can run without fear of concurrent modification to
i_dirhash.
- Hold an exclusive dirhash lock when creating the dirhash structure for
the first time or when re-creating a dirhash structure which has been
recycled.

Tested by: kris, pho


178109 11-Apr-2008 jeff

- cache dp->i_offset in the local 'i_offset' variable for use in loop
indexes so directory lookup becomes shared lock safe. In the modifying
cases an exclusive lock is held here so the commit routine may
rely on the state of i_offset.
- Similarly handle i_diroff by fetching at the start and setting only once
the operation is complete. Without the exclusive lock these are only
considered hints.
- Assert that an exclusive lock is held when we're preparing for a commit
routine.
- Honor the lock type request from lookup instead of always using exclusive
locking.

Tested by: pho, kris


177983 07-Apr-2008 pjd

Correct function name in panic().

Reported by: kensmith


177957 06-Apr-2008 attilio

Optimize lockmgr in order to get rid of the pool mutex interlock, of the
state transitioning flags and of msleep(9) callings.
Use, instead, an algorithm very similar to what sx(9) and rwlock(9)
alredy do and direct accesses to the sleepqueue(9) primitive.

In order to avoid writer starvation a mechanism very similar to what
rwlock(9) uses now is implemented, with the correspective per-thread
shared lockmgrs counter.

This patch also adds 2 new functions to lockmgr KPI: lockmgr_rw() and
lockmgr_args_rw(). These two are like the 2 "normal" versions, but they
both accept a rwlock as interlock. In order to realize this, the general
lockmgr manager function "__lockmgr_args()" has been implemented through
the generic lock layer. It supports all the blocking primitives, but
currently only these 2 mappers live.

The patch drops the support for WITNESS atm, but it will be probabilly
added soon. Also, there is a little race in the draining code which is
also present in the current CVS stock implementation: if some sharers,
once they wakeup, are in the runqueue they can contend the lock with
the exclusive drainer. This is hard to be fixed but the now committed
code mitigate this issue a lot better than the (past) CVS version.
In addition assertive KA_HELD and KA_UNHELD have been made mute
assertions because they are dangerous and they will be nomore supported
soon.

In order to avoid namespace pollution, stack.h is splitted into two
parts: one which includes only the "struct stack" definition (_stack.h)
and one defining the KPI. In this way, newly added _lockmgr.h can
just include _stack.h.

Kernel ABI results heavilly changed by this commit (the now committed
version of "struct lock" is a lot smaller than the previous one) and
KPI results broken by lockmgr_rw() / lockmgr_args_rw() introduction,
so manpages and __FreeBSD_version will be updated accordingly.

Tested by: kris, pho, jeff, danger
Reviewed by: jeff
Sponsored by: Google, Summer of Code program 2007


177785 31-Mar-2008 kib

Add the support for the AT_FDCWD and fd-relative name lookups to the
namei(9).

Based on the submission by rdivacky,
sponsored by Google Summer of Code 2007
Reviewed by: rwatson, rdivacky
Tested by: pho


177779 31-Mar-2008 jeff

- Since rev 1.142 of ffs_snapshot.c the interlock has not been required
to protect the v_lock pointer. Removing the interlock acquisition
here allows vn_lock() to proceed without requiring the interlock
at all.
- If the lock mutated while we were sleeping on it the interlock has
been dropped. It is conceivable that the upper layer code was
relying on the interlock and LK_NOWAIT to protect the identity or
state of the vnode while acquiring the lock. In this case return
EBUSY rather than trying the new lock to prevent potential races.

Reviewed by: tegge


177778 31-Mar-2008 jeff

- Don't free snapdata structures when they are no longer in use.
Keeping the lockmgr lock valid allows us to switch the v_lock pointer
in snapshot vnodes between the embedded lockmgr lock and snapdata
lock without needing the vnode interlock to protect against races
- Keep unused snapdata structures in a list.
- Add a function to lock the devvp and allocate a snapdata to it or
acquire a new one without races. The old function was safe from
creation races because we set the mount flag when creating snapshots
and thus serializing them. However, it might have been subject to
destroying races.

Reviewed by: tegge


177645 26-Mar-2008 jhb

Fix a nit with the 'nofoo' options where 'foo' is mapped to 'nonofoo'
(such as 'atime' vs 'noatime'). The filesystems will always see either
'nofoo' or 'nonofoo', never plain 'foo'. As such, their list of valid
mount options should include 'nofoo' instead of 'foo'. With this fix,
you can do 'mount -u -o atime' on a FFS filesystem that isn't marked as
noatime without getting an error. You can also update a noatime FFS
filesystem mounted via mount(2) (e.g. 6.x /sbin/mount binary) to 'atime'
using nmount(2) (e.g. 7.x /sbin/mount binary).

MFC after: 1 week
Reviewed by: crodig


177633 26-Mar-2008 dfr

Add the new kernel-mode NFS Lock Manager. To use it instead of the
user-mode lock manager, build a kernel with the NFSLOCKD option and
add '-k' to 'rpc_lockd_flags' in rc.conf.

Highlights include:

* Thread-safe kernel RPC client - many threads can use the same RPC
client handle safely with replies being de-multiplexed at the socket
upcall (typically driven directly by the NIC interrupt) and handed
off to whichever thread matches the reply. For UDP sockets, many RPC
clients can share the same socket. This allows the use of a single
privileged UDP port number to talk to an arbitrary number of remote
hosts.

* Single-threaded kernel RPC server. Adding support for multi-threaded
server would be relatively straightforward and would follow
approximately the Solaris KPI. A single thread should be sufficient
for the NLM since it should rarely block in normal operation.

* Kernel mode NLM server supporting cancel requests and granted
callbacks. I've tested the NLM server reasonably extensively - it
passes both my own tests and the NFS Connectathon locking tests
running on Solaris, Mac OS X and Ubuntu Linux.

* Userland NLM client supported. While the NLM server doesn't have
support for the local NFS client's locking needs, it does have to
field async replies and granted callbacks from remote NLMs that the
local client has contacted. We relay these replies to the userland
rpc.lockd over a local domain RPC socket.

* Robust deadlock detection for the local lock manager. In particular
it will detect deadlocks caused by a lock request that covers more
than one blocking request. As required by the NLM protocol, all
deadlock detection happens synchronously - a user is guaranteed that
if a lock request isn't rejected immediately, the lock will
eventually be granted. The old system allowed for a 'deferred
deadlock' condition where a blocked lock request could wake up and
find that some other deadlock-causing lock owner had beaten them to
the lock.

* Since both local and remote locks are managed by the same kernel
locking code, local and remote processes can safely use file locks
for mutual exclusion. Local processes have no fairness advantage
compared to remote processes when contending to lock a region that
has just been unlocked - the local lock manager enforces a strict
first-come first-served model for both local and remote lockers.

Sponsored by: Isilon Systems
PR: 95247 107555 115524 116679
MFC after: 2 weeks


177528 23-Mar-2008 kib

Yield the cpu in the kernel while iterating the list of the
vnodes belonging to the mountpoint. Also, yield when in the
softdep_process_worklist() even when we are not going to sleep due to
buffer drain.

It is believed that the ULE fixed the problem [1], but the yielding
seems to be needed at least for the 4BSD case.

Discussed: on stable@, with bde
Reviewed by: tegge, jeff [1]
MFC after: 2 weeks


177493 22-Mar-2008 jeff

- Complete part of the unfinished bufobj work by consistently using
BO_LOCK/UNLOCK/MTX when manipulating the bufobj.
- Create a new lock in the bufobj to lock bufobj fields independently.
This leaves the vnode interlock as an 'identity' lock while the bufobj
is an io lock. The bufobj lock is ordered before the vnode interlock
and also before the mnt ilock.
- Exploit this new lock order to simplify softdep_check_suspend().
- A few sync related functions are marked with a new XXX to note that
we may not properly interlock against a non-zero bv_cnt when
attempting to sync all vnodes on a mountlist. I do not believe this
race is important. If I'm wrong this will make these locations easier
to find.

Reviewed by: kib (earlier diff)
Tested by: kris, pho (earlier diff)


177474 21-Mar-2008 kib

Reduce the acquisition of the vnode interlock in the ffs_read() and
ffs_extread() when setting the IN_ACCESS flag by checking whether the
IN_ACCESS is already set. The possible race there is admissible.

Tested by: pho
Submitted by: jeff


177368 19-Mar-2008 jeff

- Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from
requiring the per-process spinlock to only requiring the process lock.
- Reflect these changes in the proc.h documentation and consumers throughout
the kernel. This is a substantial reduction in locking cost for these
fields and was made possible by recent changes to threading support.


177253 16-Mar-2008 rwatson

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink


177156 13-Mar-2008 cokane

Replace the non-MPSAFE timeout(9) API in ffs_softdep.c with the MPSAFE
callout_* API (e.g. callout_init_mtx(9)). This was one of the numerous
items on the http://wiki.freebsd.org/SMPTODO list.

Reviewed by: imp, obrien, jhb
MFC after: 1 week


177034 10-Mar-2008 emaste

Remove include of opt_quota.h; as of revision 1.205 there is no longer
any #ifdef QUOTA conditional code.


176831 05-Mar-2008 kib

Initialize mnt_stat.f_iosize before autostarting UFS1 extattrs.
It is normally initialized by ffs_statfs() after ffs_mount finished.

The extattr autostart code calls the ufs_lookup(), that uses value above
to iterate over the directory blocks, see bmask initialization in the
ufs_lookup() and ufsdirhash. Having the filesystem with root directory
spanning more then one block would result in reading a random kernel
memory.

PR: kern/120781
Test case provided by: rwatson
MFC after: 1 week


176797 04-Mar-2008 rwatson

Continue on-going campaign to replace lockmgr locks with sx locks where
the specific semantics of ockmgr aren't required: update UFS1 extended
attributes to protect its data structures using an sx lock.

While here, update comments on lock granularity.

MFC after: 2 weeks


176795 04-Mar-2008 rwatson

Move setting of MNTK_MPSAFE flag before UFS1 extended attribute
auto-start so that the flag is set before we start performing I/O
in the auto-start routine.

MFC after: 2 weeks
Suggested by: kib


176752 02-Mar-2008 rwatson

Don't auto-start or allow extattrctl for UFS2 file systems, as UFS2 has
native extended attributes. This didn't interfere with the operation of
UFS2 extended attributes, but the code shouldn't be running for UFS2.

MFC after: 2 weeks


176564 25-Feb-2008 keramida

Minor typo nit.


176559 25-Feb-2008 attilio

Axe the 'thread' argument from VOP_ISLOCKED() and lockstatus() as it is
always curthread.

As KPI gets broken by this patch, manpages and __FreeBSD_version will be
updated by further commits.

Tested by: Andrea Barberio <insomniac at slackware dot it>


176519 24-Feb-2008 attilio

Introduce some functions in the vnode locks namespace and in the ffs
namespace in order to handle lockmgr fields in a controlled way instead
than spreading all around bogus stubs:
- VN_LOCK_AREC() allows lock recursion for a specified vnode
- VN_LOCK_ASHARE() allows lock sharing for a specified vnode

In FFS land:
- BUF_AREC() allows lock recursion for a specified buffer lock
- BUF_NOREC() disallows recursion for a specified buffer lock

Side note: union_subr.c::unionfs_node_update() is the only other function
directly handling lockmgr fields. As this is not simple to fix, it has
been left behind as "sole" exception.


176320 15-Feb-2008 attilio

- Introduce lockmgr_args() in the lockmgr space. This function performs
the same operation of lockmgr() but accepting a custom wmesg, prio and
timo for the particular lock instance, overriding default values
lkp->lk_wmesg, lkp->lk_prio and lkp->lk_timo.
- Use lockmgr_args() in order to implement BUF_TIMELOCK()
- Cleanup BUF_LOCK()
- Remove LK_INTERNAL as it is nomore used in the lockmgr namespace

Tested by: Andrea Barberio <insomniac at slackware dot it>


175635 24-Jan-2008 attilio

Cleanup lockmgr interface and exported KPI:
- Remove the "thread" argument from the lockmgr() function as it is
always curthread now
- Axe lockcount() function as it is no longer used
- Axe LOCKMGR_ASSERT() as it is bogus really and no currently used.
Hopefully this will be soonly replaced by something suitable for it.
- Remove the prototype for dumplockinfo() as the function is no longer
present

Addictionally:
- Introduce a KASSERT() in lockstatus() in order to let it accept only
curthread or NULL as they should only be passed
- Do a little bit of style(9) cleanup on lockmgr.h

KPI results heavilly broken by this change, so manpages and
FreeBSD_version will be modified accordingly by further commits.

Tested by: matteo


175486 19-Jan-2008 attilio

- Introduce the function lockmgr_recursed() which returns true if the
lockmgr lkp, when held in exclusive mode, is recursed
- Introduce the function BUF_RECURSED() which does the same for bufobj
locks based on the top of lockmgr_recursed()
- Introduce the function BUF_ISLOCKED() which works like the counterpart
VOP_ISLOCKED(9), showing the state of lockmgr linked with the bufobj

BUF_RECURSED() and BUF_ISLOCKED() entirely replace the usage of bogus
BUF_REFCNT() in a more explicative and SMP-compliant way.
This allows us to axe out BUF_REFCNT() and leaving the function
lockcount() totally unused in our stock kernel. Further commits will
axe lockcount() as well as part of lockmgr() cleanup.

KPI results, obviously, broken so further commits will update manpages
and freebsd version.

Tested by: kris (on UFS and NFS)


175294 13-Jan-2008 attilio

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


175202 10-Jan-2008 attilio

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


175068 03-Jan-2008 kib

ffs_balloc_ufsX() routines, in the case of recovering from the failed
allocation, free the indirect blocks before clearing the disk pointers,
that could lead to the softupdate inconsistencies in the case of the
machine or disk crash at the wrong time.

Rearrange the recover code to do the ffs_blkfree() after the second
ffs_syncvnode(), that clears the pointers chain.

Proposed and reviewed by: tegge
Tested by: Peter Holm
MFC after: 3 weeks


175053 02-Jan-2008 obrien

style(9)


174973 29-Dec-2007 kib

The ffs_balloc() routines, whan allocating the indirect blocks for
the inode, do the rollback in case the allocation failed (due to
insufficient free space or quota limits). But, the code does leaves the
buffers corresponding to the inoirect blocks on the vnode bufobj list.
This causes several assertion failures (for instance, "ffs_truncate3"
in ffs_truncate()) to fail, and could result in the indirect block
aliasing problem, like writing the context of such blocks to random
disk location.

Remove the buffers from the bufobj properly.

Reported and tested by: Peter Holm
Reviewed by: tegge
MFC after: 3 weeks


174126 01-Dec-2007 kensmith

Fix a broken check that recently became more annoying because it now
gets enabled when INVARIANTS is on instead of DIAGNOSTIC (which apparently
nobody uses). From Tor's description:

This happens when the block range spans two block maps, the first in the
inode (mapping up to NDADDR direct blocks) and the second being the first
indirect block. The current check assumes that both block maps are
indirect blocks.

Work done by: tegge
Tested by: kris, kensmith


173501 09-Nov-2007 ru

Fix build without INVARIANTS and update a comment to match
a change made in previous revision.


173464 08-Nov-2007 obrien

Turn most ffs 'DIAGNOSTIC's into INVARIANTS.


172930 24-Oct-2007 rwatson

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


172836 20-Oct-2007 julian

Rename the kthread_xxx (e.g. kthread_create()) calls
to kproc_xxx as they actually make whole processes.
Thos makes way for us to add REAL kthread_create() and friends
that actually make theads. it turns out that most of these
calls actually end up being moved back to the thread version
when it's added. but we need to make this cosmetic change first.

I'd LOVE to do this rename in 7.0 so that we can eventually MFC the
new kthread_xxx() calls.


172697 16-Oct-2007 alfred

Get rid of qaddr_t.

Requested by: bde


172113 10-Sep-2007 bz

Fix a DIV0 in case a large value for fs_avgfilesize or fs_avgfpdir
is given (with newfs or tunefs) and dirsize overflows.

In case dirsize is <= 0 because of an overflow set maxcontigdirs
to 0 so it will be 1 later. This is what would happen for large
fs_avgfilesize. [1]

Identified with help from: roberto, pjd
Submitted by: pjd [1]
Approved by: re (rwatson)
MFC after: 8 days


171437 13-Jul-2007 rodrigc

Perform range check before allocating memory when reading
extended attributes.

Reviewed by: kib
Approved by: re (hrs)
PR: 114389


171147 02-Jul-2007 peter

Fix an annoying pointer/int cast warning that shows up on 64 bit systems.

Approved by: re


170991 22-Jun-2007 kib

Fix livelock that could occur when snapshoting UFS with quotas, where
some quota limit was exceeded. Sequence of UFS_VALLOC()/UFS_VFREE()
call there could cause inodeblock to have both freefile and inodedep
dependencies without any inode in the block being marked for write.
Then, softdep_check_suspend() would return EAGAIN forewer.

Force write of inodeblock with allocated freefile softdependency by
setting IN_MODIFIED flag in softdep_freefile and unconditionally calling
UFS_UPDATE() in ufs_reclaim.

Reported by: kris
Debug help and tested by: Peter Holm
Approved by: re (kensmith)
MFC after: 3 weeks


170587 12-Jun-2007 rwatson

Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in
some cases, move to priv_check() if it was an operation on a thread and
no other flags were present.

Eliminate caller-side jail exception checking (also now-unused); jail
privilege exception code now goes solely in kern_jail.c.

We can't yet eliminate suser() due to some cases in the KAME code where
a privilege check is performed and then used in many different deferred
paths. Do, however, move those prototypes to priv.h.

Reviewed by: csjp
Obtained from: TrustedBSD Project


170307 05-Jun-2007 jeff

Commit 14/14 of sched_lock decomposition.
- Use thread_lock() rather than sched_lock for per-thread scheduling
sychronization.
- Use the per-process spinlock rather than the sched_lock for per-process
scheduling synchronization.

Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)


170183 01-Jun-2007 kib

Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation
argument from being file descriptor index into the pointer to struct file:
part 2. Convert calls missed in the first big commit.

Noted by: rwatson
Pointy hat to: kib


170174 01-Jun-2007 jeff

- Move rusage from being per-process in struct pstats to per-thread in
td_ru. This removes the requirement for per-process synchronization in
statclock() and mi_switch(). This was previously supported by
sched_lock which is going away. All modifications to rusage are now
done in the context of the owning thread. reads proceed without locks.
- Aggregate exiting threads rusage in thread_exit() such that the exiting
thread's rusage is not lost.
- Provide a new routine, rufetch() to fetch an aggregate of all rusage
structures from all threads in a process. This routine must be used
in any place requiring a rusage from a process prior to it's exit. The
exited process's rusage is still available via p_ru.
- Aggregate tick statistics only on demand via rufetch() or when a thread
exits. Tick statistics are kept in the thread and protected by sched_lock
until it exits.

Initial patch by: attilio
Reviewed by: attilio, bde (some objections), arch (mostly silent)


170152 31-May-2007 kib

Revert UF_OPENING workaround for CURRENT.
Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation
argument from being file descriptor index into the pointer to struct file.

Proposed and reviewed by: jhb
Reviewed by: daichi (unionfs)
Approved by: re (kensmith)


170041 28-May-2007 pjd

- Remove unnecessary vnode internal locking - v_vflag is protect by vnode's
lock (not vnode's interlock).
- Simplify code a bit.


169898 23-May-2007 pjd

Eliminate VI_LOCK()/VI_UNLOCK() pair from getattr and close code paths.
It's hard to measure performance improvement on my test machine, but the
change won't degrade performance for sure. I can measure slight improvement
for debugging kernel and it can also be a win for machines where atomic
operation is more expensive.

Reviewed by: kib


169671 18-May-2007 kib

Since renaming of vop_lock to _vop_lock, pre- and post-condition
function calls are no more generated for vop_lock.
Rename _vop_lock to vop_lock1 to satisfy tools/vnode_if.awk assumption
about vop naming conventions. This restores pre/post-condition calls.


169239 03-May-2007 thompsa

Add a newline to the printf message.


168576 10-Apr-2007 kib

Fix the NAMEI zone leak when snapshot was successfully created.

Reported and tested by: Peter Holm
MFC after: 2 weeks


168575 10-Apr-2007 kib

Recalculate the NEWBLOCK flag for pagedep structure after the softdep
lock is dropped, since pagedep may be already processed and deallocated.

Found and tested by: kris
MFC after: 2 weeks


168574 10-Apr-2007 kib

When LK_NOWAIT is passed as argument to process_worklist_item(), this
does not prevent handle_workitem_remove() from recursing into a blocking
version. Add the dirrem to worklist instead of processing it now if this
is the case.

Reported and tested by: kris
Submitted by: tegge
MFC after: 2 weeks


168353 04-Apr-2007 delphij

Use *_EMPTY macros when appropriate.


168021 29-Mar-2007 kib

Revert rev. 1.205. Replace unconditional acquision of Giant when QUOTAS are
defined with VFS_LOCK_GIANT(NULL) call.
This shall fix softdep operation when mpsafe_vfs = 0.

Reported and tested by: kris
Submitted by: tegge
MFC after: 1 week


167737 20-Mar-2007 kib

Mark UFS as being MP-Safe in "options QUOTA" case too. Remove no more
neccessary Giant acquisions in softdepend processing code.

Tested by: Peter Holm
Reviewed by: tegge
Approved by: re (kensmith)


167719 19-Mar-2007 brian

When we write extended attributes, assert that the inode hasn't
already been deleted. The assertion is important to show that
we won't end up accounting for extended attribute blocks (using
fs_pendingblocks) in our subsequent call to fs_alloc().

Agreed verbally by: mckusick

MFC after: 3 weeks


167543 14-Mar-2007 kib

Implement fine-grained locking for UFS quotas.

Each struct dquot gets dq_lock mutex to protect dq_flags and to interlock
with DQ_LOCK. qhash, dqfreelist and dq.dq_cnt are protected by global
dqhlock mutex.

i_dquot array for inode is protected by lockmgr' vnode lock, corresponding
assert added to the dqget(). Access to struct ufsmount quota-related fields
(um_quotas and um_qflags) is protected by um_lock.

Tested by: Peter Holm
Reviewed by: tegge
Approved by: re (kensmith)

This work were not possible without enormous amount of help given by
Tor Egge and Peter Holm. Tor reviewed each version of patch, pointed out
numerous errors and provided invaluable suggestions. Peter did tireless
testing of the patch as it was developed.


167542 14-Mar-2007 kib

Call getinoquota() before allocating new block for the directory to properly
account for block allocation.

Tested by: Peter Holm
Reviewed by: tegge
Approved by: re (kensmith)


167541 14-Mar-2007 kib

Remove unneeded getinoquota() call in the ufs_access().

Tested by: Peter Holm
Reviewed by: tegge
Approved by: re (kensmith)


167497 13-Mar-2007 tegge

Make insmntque() externally visibile and allow it to fail (e.g. during
late stages of unmount). On failure, the vnode is recycled.

Add insmntque1(), to allow for file system specific cleanup when
recycling vnode on failure.

Change getnewvnode() to no longer call insmntque(). Previously,
embryonic vnodes were put onto the list of vnode belonging to a file
system, which is unsafe for a file system marked MPSAFE.

Change vfs_hash_insert() to no longer lock the vnode. The caller now
has that responsibility.

Change most file systems to lock the vnode and call insmntque() or
insmntque1() after a new vnode has been sufficiently setup. Handle
failed insmntque*() calls by propagating errors to callers, possibly
after some file system specific cleanup.

Approved by: re (kensmith)
Reviewed by: kib
In collaboration with: kib


167259 06-Mar-2007 mckusick

Move macros describing extended attributes in UFS from
<sys/extattr.h> to <ufs/ufs/extattr.h>. Move description
of extended attributes in UFS from man9/extattr.9 to
man5/fs.5.

Note that restore will not compile until <sys/extattr.h>
and <ufs/ufs/extattr.h> have been updated.

Suggested by: Robert Watson


167155 01-Mar-2007 pjd

Fix build breakage.


167154 01-Mar-2007 pjd

Change:
"... try to use VADMIN in preference to VADMIN ..."
To:
"... try to use VADMIN in preference to VWRITE ..."


167152 01-Mar-2007 pjd

Rename PRIV_VFS_CLEARSUGID to PRIV_VFS_RETAINSUGID, which seems to better
describe the privilege.

OK'ed by: rwatson


167151 01-Mar-2007 pjd

Avoid checking for privileges if there is no need to.

Discussed with: rwatson


166924 23-Feb-2007 brian

Account for di_blocks allocations when IN_SPACECOUNTED is set in an
inode's i_flag.

It's possible that after ufs_infactive() calls softdep_releasefile(),
i_nlink stays >0 for a considerable amount of time (> 60 seconds here).
During this period, any ffs allocation routines that alter di_blocks
must also account for the blocks in the filesystem's fs_pendingblocks
value.

This change fixes an eventual df/du discrepency that will happen as
the result of fs_pendingblocks being reduced to <0.

The only manifestation of this that people may recognise is the
following message on boot:

/somefs: update error: blocks -N files M

at which point the negative pending block count is adjusted to zero.

Reviewed by: tegge
MFC after: 3 weeks


166864 21-Feb-2007 mckusick

The functions that set and delete external attributes must check
that the filesystem is not mounted read-only before proceeding.

Reported by: Ryan Beasley <ryanb@FreeBSD.org>
MFC after: 1 week


166832 19-Feb-2007 rwatson

Rename three quota privileges from the UFS privilege namespace to the
VFS privilege namespace: exceedquota, getquota, and setquota. Leave
UFS-specific quota configuration privileges in the UFS name space.

This renumbers VFS and UFS privileges, so requires rebuilding modules
if you are using security policies aware of privilege identifiers.
This is likely no one at this point since none of the committed MAC
policies use the privilege checks.


166831 19-Feb-2007 rwatson

Limit quota privileges in jail to PRIV_UFS_GETQUOTA and
PRIV_UFS_SETQUOTA.


166799 17-Feb-2007 mckusick

This README file is obsolete. The cited problems were fixed long ago
and the code is installed by default so no longer requires action by
the administrator to be included.


166774 15-Feb-2007 pjd

Move vnode-to-file-handle translation from vfs_vptofh to vop_vptofh method.
This way we may support multiple structures in v_data vnode field within
one file system without using black magic.

Vnode-to-file-handle should be VOP in the first place, but was made VFS
operation to keep interface as compatible as possible with SUN's VFS.
BTW. Now Solaris also implements vnode-to-file-handle as VOP operation.

VFS_VPTOFH() was left for API backward compatibility, but is marked for
removal before 8.0-RELEASE.

Approved by: mckusick
Discussed with: many (on IRC)
Tested with: ufs, msdosfs, cd9660, nullfs and zfs


166743 15-Feb-2007 kib

Style(9).


166564 08-Feb-2007 kib

Remove not needed acquision of the mount interlock aroung reading of
mnt_kern_flags in ufs_itimes().

Suggested by: ssouhlal
Confirmed by: tegge
MFC after: 2 weeks


166506 04-Feb-2007 tegge

Call pbgetvp() and pbrelvp() instead of setting b_vp directly.

PR: kern/108151


166487 04-Feb-2007 mpp

If quotacheck or edquota reset the block or inode grace time for
a user or group, when the kernel first sees this, it will update
the grace time value. However, it never flags the quota as modified
and the updated value never makes it to the quota data file unless
the user actually makes some other change that would write the
data out.

Fixed to flag the quota as modified if the soft limit has actually
been reached and should be now enforced.


166381 01-Feb-2007 mpp

Prevent quotactl calls that pass in an id of -1 from incorrectly
using the callers UID instead of the GID when performing group
operations. This could allow users to determine group quota
information for groups they are not a member of in some cases.

Rename the "uid" parameter in ufs_quotactl to "id" to better show
that it is used for more than just the uid, and to be more in line
with the naming conventions in the other quota routines.

PR: kern/33940


166380 01-Feb-2007 mpp

Disallow negative UIDs when processing quotactl options.


166193 23-Jan-2007 kib

Cylinder group bitmaps and blocks containing inode for a snapshot
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.

Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.

Reviewed by: tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by: Peter Holm
X-MFC after: 3 weeks (if ever: it changes ABI)


166146 20-Jan-2007 delphij

Fix build. chkdquot() should not return anything.


166142 20-Jan-2007 mpp

Quota system cleanup.

1) Do not do quota accounting for the actual quota data files
or for file system snapshot files ("system" files). This
prevents a deadlock descibed in PR kern/30958 if the kernel
ever has to grow the quota file. Snapshot files were already
exempt from the quota checks, but this change generalized the check.
2) Fix a cast that caused extremely large uids/gids to incorrectly
write the quota information to the data file at a truncated
value for a uint_t32 id value. The incorrect cast caused quota
files in this case to be around 4GB in size, with the correct cast
they can now be 131GB in size. Also related to PR kern/30958.
3) Check for what appear to be negative UIDs/GIDs and not account
for them. This prevents the quota files from becoming 131GB in
size and causing quotacheck to run forever at bootup. This could
also cause the kernel to try and expand the quota file, which might
deadlock due to the issue in #1. kern/30958 and kern/38156
(and some much older closed PR's).
4) With the deadlock problems gone, the kernel can now expand the
size of the quota database files if it needs to.
5) Pass in the i-node count change value to chkiq and chkiqchg as an
int, like it used to be before the common routine was split up
into 2 different routines to increase / decrease the i-node in-use
count. Prevents an underflow on the i-node count. Related
to PR kern/89247.
6) Prevent the block usage from growing slowly if a file system is
full and the write was denied due to that fact. PR kern/89247.

Some of these changes require an updated quotacheck to prevent
the creation of huge (131GB) quota data files (item #3).

#1/#4 probably fixes a lot of the random hangs when quotas are enabled,
possibly some of the jail hangs.


166052 16-Jan-2007 mpp

Fix a spelling error. heirarchy -> hierarchy.

Obtained from: OpenBSD


166051 16-Jan-2007 mpp

Fix a spelling error in some comments. heirarchy -> hierarchy.

Obtained from: OpenBSD


165890 08-Jan-2007 rwatson

Canonicalize copyright: use a date range rather than comma-delimited
list.

MFC after: 3 days


164248 13-Nov-2006 kmacy

change vop_lock handling to allowing tracking of callers' file and line for
acquisition of lockmgr locks

Approved by: scottl (standing in for mentor rwatson)


164033 06-Nov-2006 rwatson

Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges. These may
require some future tweaking.

Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>


163874 01-Nov-2006 kib

Aquire Giant in the softdep_flush for clear_remove() and clear_inodedeps()
processing when QUOTA is set.

Reported and tested by: Peter Holm
Reviewed by: tegge
MFC after: 3 days


163841 31-Oct-2006 pjd

Add gjournal specific code to the UFS file system:
- Add FS_GJOURNAL flag which enables gjournal support on a file system.
- Add cg_unrefs field to the cylinder group structure which holds
number of unreferenced (orphaned) inodes in the given cylinder group.
- Add fs_unrefs field to the super block structure which holds
total number of unreferenced (orphaned) inodes.
- When file or a directory is orphaned (last reference is removed, but
object is still open), increase fs_unrefs and cg_unrefs fields,
which is a hint for fsck in which cylinder groups looks for such
(orphaned) objects.
- When file is last closed, decrease {fs,cg}_unrefs fields.
- Add VV_DELETED vnode flag which points at orphaned objects.

Sponsored by: home.pl


163606 22-Oct-2006 rwatson

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


163194 10-Oct-2006 kib

Do not translate the IN_ACCESS inode flag into the IN_MODIFIED while filesystem
is suspending/suspended. Doing so may result in deadlock. Instead, set the
(new) IN_LAZYACCESS flag, that becomes IN_MODIFIED when suspend is lifted.

Change the locking protocol in order to set the IN_ACCESS and timestamps
without upgrading shared vnode lock to exclusive (see comments in the
inode.h). Before that, inode was modified while holding only shared
lock.

Tested by: Peter Holm
Reviewed by: tegge, bde
Approved by: pjd (mentor)
MFC after: 3 weeks


162942 02-Oct-2006 tegge

Correct check for when IO_SYNC should be set for filesystem
not using softupdates when truncating a directory to zero length.

Discussed with: bde


162654 26-Sep-2006 tegge

Protect change to bo_flag by holding the bufobj mutex.


162653 26-Sep-2006 tegge

Reduce fluctuations of mnt_flag to allow unlocked readers to get a
slightly more consistent view.


162652 26-Sep-2006 tegge

Don't restore MNT_QUOTA bit in mnt_flag after snapshot creation,
closing a race between nmount() and quotactl().


162650 26-Sep-2006 tegge

Increase mnt_noasync once in softdep_mount() to disallow async io,
closing a window where a file system using softupdates could be async
for a short while if both MNT_UPDATE and MNT_ASYNC were passed as flags
to nmount(). Add MNTK_SOFTDEP flag to ensure that softdep_mount()
doesn't increase mnt_noasync multiple times.


162649 26-Sep-2006 tegge

Add mnt_noasync counter to better handle interleaved calls to nmount(),
sync() and sync_fsync() without losing MNT_ASYNC. Add MNTK_ASYNC flag
which is set only when MNT_ASYNC is set and mnt_noasync is zero, and
check that flag instead of MNT_ASYNC before initiating async io.


162647 26-Sep-2006 tegge

Use mount interlock to protect all changes to mnt_flag and mnt_kern_flag.
This eliminates a race where MNT_UPDATE flag could be lost when nmount()
raced against sync(), sync_fsync() or quotactl().


162460 20-Sep-2006 kib

Fix the glitch introduced in rev. 1.93. In softdep_sync_metadata(),
switch by worklist type contains two for() loops, for D_INDIRDEP and
D_PAGEDEP. On error, these loops are exited by break, where the switch
actually shall be leaved. Use goto instead of break to reach the error
handling code.

Reported by: Peter Holm
Reviewed by: tegge
Approved by: pjd (mentor)
MFC after: 2 weeks


162383 17-Sep-2006 rwatson

Declare security and security.bsd sysctl hierarchies in sysctl.h along
with other commonly used sysctl name spaces, rather than declaring them
all over the place.

MFC after: 1 month
Sponsored by: nCircle Network Security, Inc.


161515 21-Aug-2006 kib

While checking for update of snapshot file in the ffs_copyonwrite,
first filter out metadata update. Otherwise, devfs vnode could be
erronously interpreted as ufs one, causing further check of i_flags
to use random memory.

PR: kern/100365
Debugged and fix described by: tegge
Approved by: pjd (mentor)
MFC after: 2 weeks


161473 20-Aug-2006 pjd

Correct typo in comment.


160859 31-Jul-2006 obrien

Rather than print out a nice error message giving details sufficent to fix
a 'ufs_dirbad' and then panicing (making it very hard to see the details),
put them in the panic message itself.


160462 18-Jul-2006 stefanf

Drop two unnecessary casts.


160269 11-Jul-2006 daichi

The ufs_lookup.c has a critical bug around the whiteout
process. UFS must check a whiteout name when it uses the
whiteout, but the current implementation does not check
the whileout name, so sometimes UFS writes over a wrong
whtieout. UFS *MUST* check the whiteout name to use a
corrent whiteout. This bug leads unionfs. panic.
This commit fixes this trouble.

Submitted by: Masanori Ozawa <ozawa@ongs.co.jp> (unionfs developer)
Reviewed by: tegge & rodrigc (mentor)
Approved by: rodrigc (mentor)
MFC after: 2 weeks


160205 09-Jul-2006 pjd

Declare UFS module version.


160204 09-Jul-2006 pjd

Change fs->fs_fsmnt to mp->mnt_stat.f_mntonname in warnings about missing
MAC and ACLs support in the kernel. If it is a first mount, fs->fs_fsmnt
is empty.

MFC after: 1 week


159209 03-Jun-2006 rodrigc

Check the sectorsize of the underlying disk before trying to
bread() the UFS superblock. Should eliminate crashes when trying
to do: mount -t ufs on an audio CD.

PR: kern/85893
Reported by: Russell Francis <rfrancis at ev dot net>
MFC after: 1 week


159109 31-May-2006 maxim

o Rearrange and remove incorrect comments.

Requested by: bde


159102 31-May-2006 maxim

o According to POSIX, the result of ftruncate(2) is unspecified
for file types other than VREG, VDIR and shared memory objects.
We already handle VREG, VLNK and VDIR cases. Silently ignore
truncate requests for all the rest. Adjust comments.

PR: kern/98064
Submitted by: bde
Security: local DoS
Regress. test: regression/fifo/fifo_misc
MFC after: 2 weeks


158952 26-May-2006 rodrigc

Remove "update" from ffs_opts. It has been moved to global_opts
in vfs_mount.c.


158924 26-May-2006 rodrigc

Remove calls to vfs_export() for exporting a filesystem for NFS mounting
from individual filesystems. Call it instead in vfs_mount.c,
after we call VFS_MOUNT() for a specific filesystem.


158867 24-May-2006 rodrigc

Take errmsg out of ffs_opts. It is already part of global_opts
in vfs_mount.c.


158802 21-May-2006 maxim

o Fix a comment: ufs2_dinode.di_blocks counts blocks not bytes actually held.


158801 21-May-2006 maxim

o Fix a comment: directory whiteout type is DT_WHT not DT_W.


158659 16-May-2006 trhodes

Provide a less cryptic panic message in place of just "found inode."


158636 16-May-2006 tegge

Read block hints list from last snapshot on the active snapshot list.


158634 15-May-2006 tegge

Copy last block on file system again after file system has been suspended.

Obtained from: NetBSD


158633 15-May-2006 tegge

Don't leak a locked buffer if last block on file system cannot be read.


158632 15-May-2006 tegge

Errors detected while file system is suspended should not trigger an
assertion failure.


158527 13-May-2006 tegge

Expunge traces of unlinked snapshot files when making a new snapshot.


158382 09-May-2006 tegge

Bring the call to softdep_releasefile() within the region protected by
vn_start_secondary_write() since it might cause file system write activity
(e.g. ffs_snapremove()).


158338 06-May-2006 tegge

ffs_syncvnode() might skip some of the blocks due to them being locked,
assuming them to be inflight write buffers. This is not always the case.
bufdaemon might hold the buffer lock and give up writing the buffer due to it
having dependencies, the file system being suspended or the vnode lock being
held by another thread. When bufdaemon decides to write the buffer there is
still a window before bufobj_wref() has been called, allowing other threads to
believe that the vnode has no dirty buffers or inflight writes.

Try harder to flush first block of new subdirectory to get rid of MKDIR_BODY
dependency.


158325 05-May-2006 tegge

Return error if vnode was reclaimed while it was temporarily unlocked.
Add missing calls to vn_finished_write() in error handling.


158322 05-May-2006 tegge

Turn off disk quotas for snapshot files.


158321 05-May-2006 tegge

Avoid locking overhead when snapshots are disabled.


158308 05-May-2006 pjd

- Set bio_done directly to NULL to indicate that we want to wait for the bio.
- Use biowait() instead of copying the code.

MFC after: 1 month


158262 03-May-2006 tegge

Detect the snapshot file being prematurely unlinked.


158261 03-May-2006 tegge

Temporarily undo clusters contribution to global runningbufspace while
handling copy on write for the buffers taking part in the cluster.


158260 03-May-2006 tegge

A side effect of calling runningbufwakeup() is that bp->b_runningbufspace is
cleared. Save old value and restore bp->b_runningbufspace before returning
from ffs_copyonwrite().


158259 02-May-2006 tegge

Close a race when VOP_LOCK() on a snapshot file is attempted at the
same time as it is changed back into a normal file. The locker would
get the shared "snaplk" lock which would no longer be the correct lock
for the vnode.


158100 28-Apr-2006 scottl

Fix a typo.


158095 28-Apr-2006 jeff

- Add a BO_NEEDSGIANT flag to the bufobj. This flag forces all child
buffers to go on the buf daemon's DIRTYGIANT queue.
- Set BO_NEEDSGIANT on ffs's devvp since the ffs_copyonwrite handler
runs in the context of the buf daemon and may require Giant.


157955 22-Apr-2006 trhodes

Revert previous to this file before an actual request is made.


157919 21-Apr-2006 trhodes

Remove what I believe are two useless ifdefs. If a user or administrator
enables multilabel, or any option for that matter, most likely they have
a reason. This will allow users to see that mulilabel is enabled via an
issued "mount" command and remove an annoying warning - printed only when
a MAC kernel is not installed - on boot up.

Discussed with: green, brueffer, Samy Al Bahra.
Probably ran past: csjp (though I can't remember).


157805 17-Apr-2006 kensmith

Fix panic() message to give the right function name.


157447 03-Apr-2006 tegge

Eliminate softdep_flush() livelock by accounting for number of worklist items
marked as being in progress.


157325 31-Mar-2006 jeff

- Release the references acquired by VOP_GETWRITEMOUNT and vfs_getvfs().

Discussed with: tegge
Tested by: kris
Sponsored by: Isilon Systems, Inc.


156899 19-Mar-2006 tegge

Allow compilation when not using softupdates.


156898 19-Mar-2006 tegge

Let snapshots make a copy of old contents for all buffers taking part in a
cluster instead of just the first buffer.

Delay buf_start() calls until snapshots have a copy of old content.

PR: kern/93942


156897 19-Mar-2006 tegge

Add kludge to avoid deadlock when unlinking snapshot.


156896 19-Mar-2006 tegge

Reduce probability of unmount failing after having unmounted snapshots.


156895 19-Mar-2006 tegge

Ensure that vnode for directory isn't reclaimed before ffs_snapshot() has
completed expunging unlinked files. It could come back at another memory
location causing a lock order reversal.


156589 12-Mar-2006 jeff

- Remove the call to softdep_waitidle after suspending the filesystem.
This does not do what I wanted as all dirty buffers must be flushed
by the call to ffs_sync and any remaining dependency work would mean
that this failed.

Pointed out by: tegge


156587 12-Mar-2006 jeff

- Remove the call to softdep_waitidle after suspending the filesystem.
This does not do what I wanted as all dirty buffers must be flushed
by the call to ffs_sync and any remaining dependency work would mean
that this failed.

Pointed out by: tegge


156560 11-Mar-2006 tegge

Block secondary writes while expunging active unlinked files.

Fix detection of active unlinked files by checking VI_OWEINACT and
VI_DOINGINACT in addition to v_usecount.

Defer inactive handling for unlinked files if the file system is mostly
suspended (secondary writes being blocked).

Perform deferred inactive handling after the file system is resumed.


156521 10-Mar-2006 tegge

Remove unneeded (and broken) usage of MNT_REF()/MNT_REL().


156451 08-Mar-2006 tegge

Use vn_start_secondary_write() and vn_finished_secondary_write() as a
replacement for vn_write_suspend_wait() to better account for secondary write
processing.

Close race where secondary writes could be started after ffs_sync() returned
but before the file system was marked as suspended.

Detect if secondary writes or softdep processing occurred during vnode sync
loop in ffs_sync() and retry the loop if needed.


156418 08-Mar-2006 tegge

Don't set IN_CHANGE and IN_UPDATE on inodes for potentially suspended
file systems. This could cause deadlocks when creating snapshots.

Reviewed by: jeff


156225 02-Mar-2006 tegge

Eliminate a deadlock when creating snapshots. Blocking vn_start_write() must
be called without any vnode locks held. Remove calls to vn_start_write() and
vn_finished_write() in vnode_pager_putpages() and add these calls before the
vnode lock is obtained to most of the callers that don't already have them.


156206 02-Mar-2006 jeff

- Acquire lk in softdep_slowdown so that it's owned when we call
softdep_speedup().
- Assert that lk is held in softdep_speedup() rather than acquiring it.
This avoids a potential lock recursion.


156203 02-Mar-2006 jeff

- Move softdep from using a global worklist to per-mount worklists. This
has many positive effects including improved smp locking, reducing
interdependencies between mounts that can lead to deadlocks, etc.
- Add the softdep worklist and various counters to the ufsmnt structure.
- Add a mount pointer to the workitem and remove mount pointers from the
various structures derived from the workitem as they are now redundant.
- Remove the poor-man's semaphore protecting softdep_process_worklist and
softdep_flushworklist. Several threads may now process the list
simultaneously.
- Add softdep_waitidle() to block the thread until all pending
dependencies being operated on by other threads have been flushed.
- Use softdep_waitidle() in unmount and snapshots to block either
operation until the fs is stable.
- Remove softdep worklist processing from the syncer and move it into the
softdep_flush() thread. This thread processes all softdep mounts
once each second and when it is called via the new softdep_speedup()
when there is a resource shortage. This removes the softdep hook
from the kernel and various hacks in header files to support it.

Reviewed by/Discussed with: tegge, truckman, mckusick
Tested by: kris


155897 22-Feb-2006 jeff

- Using LK_NOWAIT in qsync() can get us into infinite loop situations that
lead to deadlocks. Remove it.

MFC After: 1 week


155572 12-Feb-2006 rwatson

In quotaoff(), lock the vnode instead of asserting it when manipulating
v_vflags.

MFC after: 1 week
Submitted by: Antoine Brodin <antoine at brodin at laposte dot net>


155555 11-Feb-2006 rwatson

Instead of asserting the vnode lock before manipulating v_vflag, acquire
it and drop it afterwards.

Found by: kris
MFC after: 1 week


155160 01-Feb-2006 jeff

- Reorder calls to vrele() after calls to vput() when the vrele is a
directory. vrele() may lock the passed vnode, which in these cases would
give an invalid lock order of child -> parent. These situations are
deadlock prone although do not typically deadlock because the vrele
is typically not releasing the last reference to the vnode. Users of
vrele must consider it as a call to vn_lock() and order it appropriately.

MFC After: 1 week
Sponsored by: Isilon Systems, Inc.
Tested by: kkenn


154152 09-Jan-2006 tegge

Add marker vnodes to ensure that all vnodes associated with the mount point are
iterated over when using MNT_VNODE_FOREACH.

Reviewed by: truckman


154150 09-Jan-2006 tegge

If the lock passed to getdirtybuf() is the softdep lock then the background
write completed wakeup could be missed. Close the race by grabbing the lock
normally used for protection of bp->b_xflags.

Reviewed by: truckman


154149 09-Jan-2006 tegge

Broaden scope of softdep_worklist_busy rwlock protection of softdep processing
to avoid some dependencies being missed by softdep_flushworklist().

Reviewed by: truckman


154065 06-Jan-2006 imp

New option: NO_FFS_SNAPSHOT. I did this in p4 about the same time
that NetBSD implemented it independently of them (don't know which one
was actually first). This saves about 24k for those times you don't
need snapshot support (like when running off a ram disk, or in an
embedded environment where size matters).


153689 23-Dec-2005 delphij

Typo.


153400 14-Dec-2005 des

Eradicate caddr_t from the VFS API.


152771 24-Nov-2005 rodrigc

Fix parsing of atime, clusterr, clusterw, exec, suid, symfollow
mount options.

Noticed by: Amir Shalem < amir at boom dot org dot il>


152639 20-Nov-2005 rodrigc

If export mount flag is not passed in, set default parameters
for export structure and pass that to vfs_export().
Currently in userland mount(8), an export structure is unconditionally
passed in, only for UFS. This is an attempt to move that UFS-specific
behavior out of mount(8) and into the UFS filesystem code.


152622 19-Nov-2005 rodrigc

Add more options to ffs_opts, so that vfs_filteropts() will not
complain when we pass these options to a UFS filesystem as strings
via nmount(): noexec, nosuid, nosymfollow, sync, suiddir


152567 18-Nov-2005 rodrigc

- Add parsing for the following existing UFS/FFS mount options in the nmount()
callpath via vfs_getopt(), and set the appropriate MNT_* flag:
-> acls, async, force, multilabel, noasync, noatime,
-> noclusterr, noclusterw, snapshot, update

- Allow errmsg as a valid mount option via vfs_getopt(),
so we can later add a hook to propagate mount errors back
to userspace via vfs_mount_error().


152163 07-Nov-2005 delphij

Slightly reorganize to reduce duplicated code.

Reviewed by: rwatson


151906 31-Oct-2005 ps

Rate limit filesystem full and out of inodes messages to once a
second.


151897 31-Oct-2005 rwatson

Normalize a significant number of kernel malloc type names:

- Prefer '_' to ' ', as it results in more easily parsed results in
memory monitoring tools such as vmstat.

- Remove punctuation that is incompatible with using memory type names
as file names, such as '/' characters.

- Disambiguate some collisions by adding subsystem prefixes to some
memory types.

- Generally prefer lower case to upper case.

- If the same type is defined in multiple architecture directories,
attempt to use the same name in additional cases.

Not all instances were caught in this change, so more work is required to
finish this conversion. Similar changes are required for UMA zone names.


151657 25-Oct-2005 delphij

Remove an unneeded "a" from comment.


151528 21-Oct-2005 njl

Adjust maxfilesize for UFS1 and old 4.4 FFS. For UFS1, increase the limit
to (max block - 1) * bsize. For DEV_BSIZE, this doubles the limit from
0.5 TB to 1 TB. For the old 4.4 FFS case, decrease the limit from 0.5 TB
to 2 GB - 1. Older systems had a 32 bit off_t so they couldn't access the
larger files anyway.

Collaboration with: bde


151390 16-Oct-2005 truckman

Correct the type of the temporary variable used by ufs_lookup.c:1.78
to fix the race condition in the ufs_lookup() ISDOTDOT code.

Noticed by: bde
MFC after: 12 days


151347 14-Oct-2005 truckman

Close a race in the ufs_lookup() code that handles the ISDOTDOT
case by saving the value of dp->i_ino before unlocking the vnode
for the current directory and passing the saved value to VFS_VGET().

Without this change, another thread can overwrite dp->i_ino after
the current directory is unlocked, causing ufs_lookup() to lock
and return the wrong vnode in place of the vnode for its parent
directory. A deadlock can occur if dp->i_ino was changed to a
subdirectory of the current directory because the root to leaf vnode
lock ordering will be violated. A vnode lock can be leaked if
dp->i_ino was changed to point to the current directory, which
causes the current vnode lock for the current directory to be
recursed, which confuses lookup() into calling vrele() when it
should be calling vput().

The probability of this bug being triggered seems to be quite low
unless the sysctl variable debug.vfscache is set to 0.

Reviewed by: jhb
MFC after: 2 weeks


151258 12-Oct-2005 rwatson

When performing a VOP_LOOKUP() as part of UFS1 extended attribute
auto-start, set cnp.cn_lkflags to LK_EXCLUSIVE. This flag must now
be set so that lockmgr knows what kind of lock to acquire, and it
will panic if not specified. This resulted in a panic when using
extended attributes on UFS1 as of locking work present in the 6.x
branch.

This is a RELENG_6_0 merge candidate.

Reported by: lofi
MFC after: 3 days


151252 12-Oct-2005 dds

Move execve's access time update functionality into a new
vfs_mark_atime() function, and use the new function for
performing efficient atime updates in mmap().

Reviewed by: bde
MFC after: 2 weeks


151218 10-Oct-2005 tegge

Avoid unintended VMIO on directories and symlinks due to leftover object
not having been destroyed.


151184 09-Oct-2005 tegge

Adjust totread argument passed to cluster_read() to account for offset not
being block aligned.


151181 09-Oct-2005 tegge

Don't pretend that a failed sync write was succesful.


151180 09-Oct-2005 tegge

Reduce probability for a deadlock that can occur when a snapshot inode is
updated by a process holding the snapshot lock. Another process updating a
different inode in the same inodeblock will do copy on write checks and lock in
the opposite direction.

The snapshot code force a copy on write of these blocks manually (cf. start of
expunge_ufs[12]) and these inode blocks are later put on snapblklist.

This partial fix is to 'drain' the relevant ffs_copyonwrite() operation after
installing new snapblklist. This is not a 100% solution since a failed block
allocation can cause implicit fsync() which might deadlock before the new
snapblklist has been installed.


151179 09-Oct-2005 tegge

Eliminate a deadlock that can occur when a dirty block belonging to a snapshot
file is flushed by a process not holding snaplk (e.g. bufdaemon). Another
process might hold snaplk and try to access the block due to ffs_copyonwrite
processing.


151178 09-Oct-2005 tegge

Eliminate a deadlock that can occur during the cgaccount() processing due to
the cg map buffer being held when writing indirect blocks. The process ends up
in ffs_copyonwrite(), attempting to get snaplk while holding the cg map buffer
lock.

Another process might be in ffs_copyonwrite(), trying to allocate a new block
for a copy. It would hold snaplk while trying to get the cg map buffer lock.

Release the cg map buffer early and use the copy for most of the cgaccount
processing to avoid this deadlock.


151177 09-Oct-2005 tegge

Reduce the probability of low block numbers passed to ffs_snapblkfree() by
skipping the call from ffs_snapremove() if the block number is zero.

Simplify snapshot locking in ffs_copyonwrite() and ffs_snapblkfree() by using
the same locking protocol for low block numbers as for larger block numbers.
This removes a lock leak that could happen if vn_lock() succeeded after
lockmgr() failed in ffs_snapblkfree().

Check if snapshot is gone before retrying a lock in ffs_copyonwrite().


151176 09-Oct-2005 tegge

Reinitialize v_type and v_op fields in case vnode has been reused without
reclamation. If the vnode previously was a fifo then v_op would point to
ffs_fifoops[12] instead of the expected ffs_vnodeops[12], causing a panic at
the end of ffsext_strategy.


150891 03-Oct-2005 truckman

Initialize the inode i_flag field in ffs_valloc() to clean up any
stale flag bits left over from before the inode was recycled.

Without this change, a leftover IN_SPACECOUNTED flag could prevent
softdep_freefile() and softdep_releasefile() from incrementing
fs_pendinginodes. Because handle_workitem_freefile() unconditionally
decrements fs_pendinginodes, a negative value could be reported at
file system unmount time with a message like:
unmount pending error: blocks 0 files -3
The pending block count in fs_pendingblocks could also be negative
for similar reasons. These errors can cause the data returned by
statfs() to be slightly incorrect. Some other cleanup code in
softdep_releasefile() could also be incorrectly bypassed.

MFC after: 3 days


150791 01-Oct-2005 truckman

Correct previous commit to fix the sense of the TDP_NORUNNINGBUF
check in ffs_copyonwrite() that is a precondition for calling
waitrunningbufspace().

Pointed out by: tegge
Pointy hat to: truckman
MFC after: 3 days


150760 30-Sep-2005 truckman

Un-staticize waitrunningbufspace() and call it before returning from
ffs_copyonwrite() if any async writes were launched.

Restore the threads previous TDP_NORUNNINGBUF state before returning
from ffs_copyonwrite().


150741 30-Sep-2005 truckman

Un-staticize runningbufwakeup() and staticize updateproc.

Add a new private thread flag to indicate that the thread should
not sleep if runningbufspace is too large.

Set this flag on the bufdaemon and syncer threads so that they skip
the waitrunningbufspace() call in bufwrite() rather than than
checking the proc pointer vs. the known proc pointers for these two
threads. A way of preventing these threads from being starved for
I/O but still placing limits on their outstanding I/O would be
desirable.

Set this flag in ffs_copyonwrite() to prevent bufwrite() calls from
blocking on the runningbufspace check while holding snaplk. This
prevents snaplk from being held for an arbitrarily long period of
time if runningbufspace is high and greatly reduces the contention
for snaplk. The disadvantage is that ffs_copyonwrite() can start
a large amount of I/O if there are a large number of snapshots,
which could cause a deadlock in other parts of the code.

Call runningbufwakeup() in ffs_copyonwrite() to decrement runningbufspace
before attempting to grab snaplk so that I/O requests waiting on
snaplk are not counted in runningbufspace as being in-progress.
Increment runningbufspace again before actually launching the
original I/O request.

Prior to the above two changes, the system could deadlock if enough
I/O requests were blocked by snaplk to prevent runningbufspace from
falling below lorunningspace and one of the bawrite() calls in
ffs_copyonwrite() blocked in waitrunningbufspace() while holding
snaplk.

See <http://www.holm.cc/stress/log/cons143.html>


150733 29-Sep-2005 truckman

After a rmdir()ed directory has been truncated, force an update of
the directory's inode after queuing the dirrem that will decrement
the parent directory's link count. This will force the update of
the parent directory's actual link to actually be scheduled. Without
this change the parent directory's actual link count would not be
updated until ufs_inactive() cleared the inode of the newly removed
directory, which might be deferred indefinitely. ufs_inactive()
will not be called as long as any process holds a reference to the
removed directory, and ufs_inactive() will not clear the inode if
the link count is non-zero, which could be the result of an earlier
system crash.

If a background fsck is run before the update of the parent directory's
actual link count has been performed, or at least scheduled by
putting the dirrem on the leaf directory's inodedep id_bufwait list,
fsck will corrupt the file system by decrementing the parent
directory's effective link count, which was previously correct
because it already took the removal of the leaf directory into
account, and setting the actual link count to the same value as the
effective link count after the dangling, removed, leaf directory
has been removed. This happens because fsck acts based on the
actual link count, which will be too high when fsck creates the
file system snapshot that it references.

This change has the fortunate side effect of more quickly cleaning
up the large number dirrem structures that linger for an extended
time after the removal of a large directory tree. It also fixes a
potential problem with the shutdown of the syncer thread timing out
if the system is rebooted immediately after removing a large directory
tree.

Submitted by: tegge
MFC after: 3 days


150663 28-Sep-2005 rwatson

Back out alpha/alpha/trap.c:1.124, osf1_ioctl.c:1.14, osf1_misc.c:1.57,
osf1_signal.c:1.41, amd64/amd64/trap.c:1.291, linux_socket.c:1.60,
svr4_fcntl.c:1.36, svr4_ioctl.c:1.23, svr4_ipc.c:1.18, svr4_misc.c:1.81,
svr4_signal.c:1.34, svr4_stat.c:1.21, svr4_stream.c:1.55,
svr4_termios.c:1.13, svr4_ttold.c:1.15, svr4_util.h:1.10,
ext2_alloc.c:1.43, i386/i386/trap.c:1.279, vm86.c:1.58,
unaligned.c:1.12, imgact_elf.c:1.164, ffs_alloc.c:1.133:

Now that Giant is acquired in uprintf() and tprintf(), the caller no
longer leads to acquire Giant unless it also holds another mutex that
would generate a lock order reversal when calling into these functions.
Specifically not backed out is the acquisition of Giant in nfs_socket.c
and rpcclnt.c, where local mutexes are held and would otherwise violate
the lock order with Giant.

This aligns this code more with the eventual locking of ttys.

Suggested by: bde


150634 27-Sep-2005 jhb

Use the refcount API to manage the reference count for user credentials
rather than using pool mutexes.

Tested on: i386, alpha, sparc64


150492 23-Sep-2005 delphij

Restore a historical ufs_inactive behavior that has been changed
in rev. 1.40 of ufs_inode.c, which allows an inode being truncated
even when the filesystem itself is marked RDONLY. A subsequent
call of UFS_TRUNCATE (ffs_truncate) would panic the system as it
asserts that it can only be called when the filesystem is mounted
read-write (same changeset, rev. 1.74 of sys/ufs/ffs/ffs_inode.c).

Because ffs_mount() already takes care of sync'ing the filesystem
to disk before being downgraded to readonly, it appears to be more
desirable that we should not permit this sort of writes to disk.

This change would fix a panic that occours when read-only mounted
a corrupted filesystem and doing some file operations.

MT6/5/4 candidate

Reviewed by: mckusick


150335 19-Sep-2005 rwatson

Add GIANT_REQUIRED and WITNESS sleep warnings to uprintf() and tprintf(),
as they both interact with the tty code (!MPSAFE) and may sleep if the
tty buffer is full (per comment).

Modify all consumers of uprintf() and tprintf() to hold Giant around
calls into these functions. In most cases, this means adding an
acquisition of Giant immediately around the function. In some cases
(nfs_timer()), it means acquiring Giant higher up in the callout.

With these changes, UFS no longer panics on SMP when either blocks are
exhausted or inodes are exhausted under load due to races in the tty
code when running without Giant.

NB: Some reduction in calls to uprintf() in the svr4 code is probably
desirable.

NB: In the case of nfs_timer(), calling uprintf() while holding a mutex,
or even in a callout at all, is a bad idea, and will generate warnings
and potential upset. This needs to be fixed, but was a problem before
this change.

NB: uprintf()/tprintf() sleeping is generally a bad ideas, as is having
non-MPSAFE tty code.

MFC after: 1 week


150010 12-Sep-2005 tegge

Giant is no longer needed here.


149811 06-Sep-2005 csjp

Convert the primary ACL allocator from malloc(9) to using a UMA zone instead.
Also introduce an aclinit function which will be used to create the UMA zone
for use by file systems at system start up.

MFC after: 1 month
Discussed with: rwatson


149808 05-Sep-2005 tegge

Retain generation count when writing zeroes instead of an inode to disk.

Don't free a struct inodedep if another process is allocating saved inode
memory for the same struct inodedep in initiate_write_inodeblock_ufs[12]().

Handle disappearing dependencies in softdep_disk_io_initiation().

Reviewed by: mckusick


149713 02-Sep-2005 ssouhlal

ffs_mountfs() needs devvp to be locked, so lock it.

Glanced at by: phk
Tested by: pjd
MFC after: 3 days


149358 21-Aug-2005 ssouhlal

Set the mountpoint path in the superblock (fs_fsmnt) at mount-time
so that it appears in the various messages (not cleanly unmounted,
filesystem full, etc). This has been broken since rev 1.261.


149354 21-Aug-2005 tegge

Don't set the COMPLETE flag in an inodedep structure before the related
inode has been written.


149178 17-Aug-2005 iedowse

In the ufsdirhash_build() failure case for corrupted directories
or unreadable blocks, make sure to destroy the mutex we created.
Also fix an unrelated typo in a comment.

Found by: Peter Holm's stress tests
Reviewed by: dwmalone
MFC after: 3 days


148608 31-Jul-2005 ups

Delay freeing disk space for file system blocks until all dirty buffers
are safely released. This fixes softdep problems on truncation (deletion)
of files with dirty buffers.

Reviewed by: jeff@, mckusick@, ps@, tegge@
Tested by: glebius@, ps@
MFC after: 3 weeks


148200 20-Jul-2005 alc

Eliminate inconsistency in the setting of the B_DONE flag. Specifically,
make the b_iodone callback responsible for setting it if it is needed.
Previously, it was set unconditionally by bufdone() without holding
whichever lock is shared by the b_iodone callback and the corresponding
top-half function. Consequently, in a race, the top-half function could
conclude that operation was done before the b_iodone callback finished.
See, for example, aio_physwakeup() and aio_fphysio().

Note: I don't believe that the other, more widely-used b_iodone callbacks
are affected.

Discussed with: jeff
Reviewed by: phk
MFC after: 2 weeks


147198 09-Jun-2005 ssouhlal

Allow EVFILT_VNODE events to work on every filesystem type, not just
UFS by:
- Making the pre and post hooks for the VOP functions work even when
DEBUG_VFS_LOCKS is not defined.
- Moving the KNOTE activations into the corresponding VOP hooks.
- Creating a MNTK_NOKNOTE flag for the mnt_kern_flag field of struct
mount that permits filesystems to disable the new behavior.
- Creating a default VOP_KQFILTER function: vfs_kqfilter()

My benchmarks have not revealed any performance degradation.

Reviewed by: jeff, bde
Approved by: rwatson, jmg (kqueue changes), grehan (mentor)


146829 31-May-2005 kensmith

This patch addresses a standards violation issue. The standards say a
file's access time should be updated when it gets executed. A while
ago the mechanism used to exec was changed to use a more mmap based
mechanism and this behavior was broken as a side-effect of that.

A new vnode flag is added that gets set when the file gets executed,
and the VOP_SETATTR() vnode operation gets called. The underlying
filesystem is expected to handle it based on its own semantics, some
filesystems don't support access time at all. Those that do should
handle it in a way that does not block, does not generate I/O if possible,
etc. In particular vn_start_write() has not been called. The UFS code
handles it the same way as it would normally handle the access time if
a file was read - the IN_ACCESS flag gets set in the inode but no other
action happens at this point. The actual time update will happen later
during a sync (which handles all the necessary locking).

Got me into this: cperciva
Discussed with: a lot with bde, a little with kan
Showed patches to: phk, jeffr, standards@, arch@
Minor discussion on: arch@


146802 30-May-2005 jeff

- Don't set our bio op to be a READ when we've just completed a write. There
are subtle differences in the read and write completion path. Instead,
grab an extra write ref so the write path can drop it when we recursively
call bufdone(). I believe this may be the source of the wrong bufobj
panics.

Reported by: pho, kkenn


146356 18-May-2005 mckusick

Allow removal of empty directories with high link counts. These can
occur on a filesystem running with soft updates after a crash and
before a background fsck has been run. To prevent discrepancies
from arising in a background fsck that may already be running,
the directory is removed but its inode is not freed and is left
with the residual reference count. When encountered by the
background fsck it will be reclaimed.


145824 03-May-2005 jeff

- Don't restrict the softdep stats to DEBUG kernels, they cost nothing to
export. This was happening anyway since this file manually sets DEBUG.
- Add a sysctl for the number of items on the worklist.
- Use a more canonical loop restart in softdep_fsync_mountdev, it saves
some code at the expense of a goto and makes me worry less about
modifying a variable that should be private to the TAILQ_FOREACH_SAFE
macro.


145702 30-Apr-2005 jeff

- Use bdone() directly instead of calling it indirectly through
ffs_rawreaddone().

Sponsored by: Isilon Systems, Inc.


145138 16-Apr-2005 pjd

- Plug memory leak.
- Fix two style nits.

Found by: Coverity Prevent analysis tool
Reviewed by: rwatson
MFC after: 1 week


145006 13-Apr-2005 jeff

- Change all filesystems and vfs_cache to relock the dvp once the child is
locked in the ISDOTDOT case. Se vfs_lookup.c r1.79 for details.

Sponsored by: Isilon Systems, Inc.


144659 05-Apr-2005 jeff

- Consistently call 'vp' vp rather than ovp sometimes in ffs_truncate().
Do the same for oip.

Pointed out by: glebius


144590 03-Apr-2005 jeff

- Use M_ZERO rather than explicitly calling bzero().
- Don't intermingle direct calls to lockmgr and indirect calls through
VOPs. This will be important in the future.
- Dont lock the devvp's interlock just to release it on the next line by
passing LK_INTERLOCK to lockmgr.
- Restructure ffs_snapshot_unmount so we don't call free() with the
devvp's interlock locked.


144586 03-Apr-2005 jeff

- In ffs_sync we need to pass LK_SLEEPFAIL in when we lock the vnode
because it may change identities while we're sleeping on the lock.
Otherwise we may bail out of ffs_sync() early due to an error from
deadfs.
- Collapse a VOP_UNLOCK, vrele into a single vput().


144585 03-Apr-2005 jeff

- Move the contents of softdep_disk_prewrite into ffs_geom_strategy to fix
two bugs.
- ffs_disk_prewrite was pulling the vp from the buf and checking for
COPYONWRITE, when really it wanted the vp from the bufobj that we're
writing to, which is the devvp. This lead to us skipping the copy on
write to all file data, which significantly broke snapshots for the
last few months.
- When the SOFTUPDATES option was not included in the kernel config we
would also skip the copy on write check, which would effectively disable
snapshots.
- Remove an invalid mp_fixme().

Debugging tips from: mckusick
Reported by: iedowse, others
Discussed with: phk


144376 31-Mar-2005 jeff

- Fix botched LK_NOWAIT removal. I mistakenly thought this compiled as
part of GENERIC.


144375 31-Mar-2005 jeff

- FFS supports shared locks, clear LK_NOSHARE from our vnode locks.

Sponsored by: Isilon Systems, Inc.


144373 31-Mar-2005 jeff

- Set LK_NOSHARE for snapshot locks. snapshots require exclusive only
access.
- Remove the hack from ffs_lock() to implement LK_NOSHARE in a ffs
specific way.

Sponsored by: Isilon Systems, Inc.


144367 31-Mar-2005 jeff

- LK_NOPAUSE is a nop now.

Sponsored by: Isilon Systems, Inc.


144300 29-Mar-2005 jeff

- Remove wantparent, it is no longer necessary. An assert in vfs_lookup.c
prevents any callers from doing a modifying op without
LOCKPARENT or WANTPARENT. It wasn't even properly used in the CREATE
or DELETE cases.


144289 29-Mar-2005 jeff

- Upgrade a shared lock request to exclusive in ffs_vget() if we have
to create the vnode.

Sponsored by: Isilon Systems, Inc.


144288 29-Mar-2005 jeff

- Honor the cn_lkflags passed from namei() when locking the leaf.

Sponsored by: Isilon Systems, Inc.


144209 28-Mar-2005 jeff

- UFS no longer uses PDIRUNLOCK to track the parent state. Instead, we now
rely on ufs to always leave the parent locked except in the ISDOTDOT
case. Adjust asserts to deal with these changes.

Sponsored by: Isilon Systems, Inc.


144208 28-Mar-2005 jeff

- We no longer have to bother with PDIRUNLOCK, lookup() handles it for us.

Sponsored by: Isilon Systems, Inc.


144118 25-Mar-2005 das

When the softupdates worklist gets too long, threads that attempt to
add more work are forced to process two worklist items first.
However, processing an item may generate additional work, causing the
unlucky thread to recursively process the worklist. Add a per-thread
flag to detect this situation and avoid the recursion. This should
fix the stack overflows that could occur while removing large
directory trees.

Tested by: kris
Reviewed by: mckusick


144057 24-Mar-2005 jeff

- Call VFS_ROOT() with LK_EXCLUSIVE.

Sponsored by: Isilon Systems, Inc.


144056 24-Mar-2005 jeff

- Update the ufs_root() prototype.
- Pass the ufs_root() flags argument to VFS_VGET() to allow callers to
specify shared locks.

Sponsored by: Isilon Systems, Inc.


143743 17-Mar-2005 jeff

- Lock the clearing of v_data in ufs_reclaim() to prevent a pagefault
in ffs_lock() when it acesses v_data without the vnlock.

Sponsored by: Isilon Systems, Inc.


143692 16-Mar-2005 phk

Add two arguments to the vfs_hash() KPI so that filesystems which do
not have unique hashes (NFS) can also use it.


143666 15-Mar-2005 phk

Don't hold a reference on the disk vnode for each inode.


143663 15-Mar-2005 phk

Improve the vfs_hash() API: vput() the unneeded vnode centrally to
avoid replicating the vput in all the filesystems.


143619 15-Mar-2005 phk

Simplify the vfs_hash calling convention.


143613 15-Mar-2005 jeff

- Destroy the vnode object earlier in VOP_RECLAIM as we need more of
the vnode valid before the vm flushes pages.
- Get rid of some extraneous uses of the vnode interlock.

Sponsored by: Isilon Systems, Inc.


143562 14-Mar-2005 phk

Use vfs_hash instead of home-rolled.


143504 13-Mar-2005 jeff

- It is not legal to access v_data without the vnode lock or interlock
held. Grab the vnode interlock if LK_INTERLOCK has not been passed in
so that we can inspect v_data in ffs_lock().

Sponsored by: Isilon Systems, Inc.


143503 13-Mar-2005 jeff

- The VI_DOOMED flag now signals the end of a vnode's relationship with
the filesystem. Check that rather than VI_XLOCK.
- Shorten ffs_reload by one step. The old check for an inactive vnode
was slightly racey, and the code which deals with still active vnodes
is not much more expensive.

Sponsored by: Isilon Systems, Inc.


143502 13-Mar-2005 jeff

- The VI_DOOMED flag now signals the end of a vnode's relationship with
the filesystem. Check that rather than VI_XLOCK.

Sponsored by: Isilon Systems, Inc.


143501 13-Mar-2005 jeff

- Fix an assert now that the XLOCK no longer exists.

Sponsored by: Isilon Systems, Inc.


143500 13-Mar-2005 jeff

- In ufs_mknod(), hold the lock across the call to vgone() as that is now
required.
- In ufs_close(), don't do the EAGAIN vrele hack, the top layer now calls
vn_start_write before the lock is acquired as it should.

Sponsored by: Isilon Systems, Inc.


143499 13-Mar-2005 jeff

- Don't drop the lock in ufs_inactive().
- Also in ufs_inactive, don't acquire the vnode interlock where it isn't
strictly needed. Also owning the vnode interlock while calling vprint()
will cause locking assertions to trip.

Sponsored by: Isilon Systems, Inc.


142879 01-Mar-2005 jeff

- Fix anoter dyslexic moment; an atomic_set_int should've become ACTIVESET,
not ACTIVECLEAR.

Submitted by: iedowse


142692 27-Feb-2005 phk

Remove debug printout of major/minor numbers, print name instead.


142682 27-Feb-2005 sam

use uiomove return value instead of always returning 0 when doing a
readlink of a fast link

Noticed by: Coverity Prevent analysis tool
Reviewed by: phk


142263 22-Feb-2005 jeff

- Add VOP locking asserts in several functions that have been implicated in
recent deadlocks.


142123 20-Feb-2005 delphij

The recomputation of file system summary at mount time can be a
very slow process, especially for large file systems that is just
recovered from a crash.

Since the summary is already re-sync'ed every 30 second, we will
not lag behind too much after a crash. With this consideration
in mind, it is more reasonable to transfer the responsibility to
background fsck, to reduce the delay after a crash.

Add a new sysctl variable, vfs.ffs.compute_summary_at_mount, to
control this behavior. When set to nonzero, we will get the
"old" behavior, that the summary is computed immediately at mount
time.

Add five new sysctl variables to adjust ndir, nbfree, nifree,
nffree and numclusters respectively. Teach fsck_ffs about these
API, however, intentionally not to check the existence, since
kernels without these sysctls must have recomputed the summary
and hence no adjustments are necessary.

This change has eliminated the usual tens of minutes of delay of
mounting large dirty volumes.

Reviewed by: mckusick
MFC After: 1 week


142079 19-Feb-2005 phk

Try to unbreak the vnode locking around vop_reclaim() (based mostly on
patch from kan@).

Pull bufobj_invalbuf() out of vinvalbuf() and make g_vfs call it on
close. This is not yet a generally safe function, but for this very
specific use it is safe. This solves the problem with buffers not
being flushed by unmount or after failed mount attempts.


142074 19-Feb-2005 delphij

When clearing a fragment, it's possible that the length is zero.

Reviewed by: mckusick
MFC After: 1 week


141927 14-Feb-2005 jeff

- Remove the unused and unsafe ufs_ihashlookup. This function returned a
vnode pointer that could not be used since no locks were held.

Sponsored by: Isilon Systems, Inc.


141685 11-Feb-2005 phk

Make non-SOFTUPDATES kernels compile again.

Integrate the stubfile into the main file now that license issues have been
long resolved.


141631 10-Feb-2005 phk

Make a some SYSCTL_NODEs and some of FFS's VFS_ methods static.


141595 09-Feb-2005 jeff

- In the softupdates case for ffs_truncate() we use vinvalbuf() to
invalidate pending io and dependencies. However, vinvalbuf() rightfully
does not call vnode_pager_setsize() for us. We must do this here. This
could potentially have caused numerous kinds of bugs, but it was
specifically causing msync() deadlocks because msync() was writing
flushing pages that should not have been valid.

Sponsored by: Isilon Systems, Inc.
Reported by: kkenn


141570 09-Feb-2005 phk

style polishing.


141543 08-Feb-2005 cperciva

Add a new sysctl, "security.jail.chflags_allowed", which controls the
behaviour of chflags within a jail. If set to 0 (the default), then a
jailed root user is treated as an unprivileged user; if set to 1, then
a jailed root user is treated the same as an unjailed root user.

This is necessary to allow "make installworld" to work inside a jail,
since it attempts to manipulate the system immutable flag on certain
files.

Discussed with: csjp, rwatson
MFC after: 2 weeks


141542 08-Feb-2005 phk

Split the vop_vector for ffs1 and ffs2, this is mostly for the different
EXTATTR support.


141541 08-Feb-2005 phk

Use ffs_truncate() directly instead of UFS_TRUNCATE()


141539 08-Feb-2005 phk

Background writes are entirely an FFS/Softupdates thing.

Give FFS vnodes a specific bufwrite method which contains all the
background write stuff and then calls into the default bufwrite()
for the rest of the job.

Remove all the background write related stuff from the normal bufwrite.

This drags the softdep_move_dependencies() back into FFS.

Long term, it is worth looking at simply copying the data into
allocated memory and issuing the bio directly and not create the
"shadow buf" in the first place (just like copy-on-write is done
in snapshots for instance). I don't think we really gain anything
but complexity from doing this with a buf.


141533 08-Feb-2005 phk

Drag another softupdates tentacle back into FFS: Now that FFS's
vop_fsync is separate from the internal use we can do the full job
there.


141526 08-Feb-2005 phk

Don't use the UFS_* and VFS_* functions where a direct call is possble.

The UFS_ functions are for UFS to call back into VFS. The VFS functions
are external entry points into the filesystem.


141523 08-Feb-2005 rwatson

Don't use VOP_LEASE() with operations on extended attribute backing
files.

Pointed out by: phk


141522 08-Feb-2005 phk

For snapshots we need all VOP_LOCKs to be exclusive.

The "business class upgrade" was implemented in UFS's VOP_LOCK
implementation ufs_lock() which is the wrong layer, so move it to
ffs_lock().

Also, as long as we have not abandonned advanced vfs-stacking we
should not preclude it from happening: instead of implementing a
copy locally, use the VOP_LOCK_APV(&ufs) to correctly arrive at
vop_stdlock() at the bottom.


141521 08-Feb-2005 phk

For snapshots we need all VOP_LOCKs to be exclusive.

The "business class upgrade" was implemented in UFS's VOP_LOCK
implementation ufs_lock() which is the wrong layer, so move it to
ffs_lock().

Also, as long as we have not abandonned advanced vfs-stacking we
should not preclude it from happening: instead of implementing a
copy locally, use the VOP_LOCK_APV(&ufs) to correctly arrive at
vop_stdlock() at the bottom.


141520 08-Feb-2005 phk

Use VOP_STRATEGY_APV() instead of direct dereference, this is more
correct.


141150 02-Feb-2005 jeff

- Use a seperate malloc tag for saved inode contents to help in debugging
memory modified after free errors.

Sponsored by: Isilon Systems, Inc.


141143 02-Feb-2005 kensmith

Back out previous commit, bde@ provided an example of something this
breaks.


141130 02-Feb-2005 kensmith

It was noticed that we do not change a file's access time when it gets
executed. This appears to violate most of the UNIX-ish standards.
One example quote from:

http://www.opengroup.org/onlinepubs/009695399/functions/exec.html

Upon successful completion, the exec functions shall mark for update
the st_atime field of the file. If an exec function failed but was
able to locate the process image file, whether the st_atime field is
marked for update is unspecified. Should the exec function succeed,
the process image file shall be considered to have been opened with
open().

This appears to take care of it for ufs filesystems, doing the necessary
sanity checks (read-only filesystem, etc) without violating any other
standards (setting atime for any open appears to be allowed in any standards
I could find).

Noticed by: cperciva
Reviewed by: kan, rwatson


141085 31-Jan-2005 imp

nit in /*-


140962 29-Jan-2005 peadar

Tell vnode_create_vobject() how big an object to create, rather
than having it work it out via the more expensive VOP_GETATTR

Reviewed by: phk@


140939 28-Jan-2005 phk

Make filesystems get rid of their own vnodes vnode_pager object in
VOP_RECLAIM().


140936 28-Jan-2005 phk

Remove unused argument to vrecycle()


140822 25-Jan-2005 phk

Introduce and use g_vfs_close().


140782 25-Jan-2005 phk

Don't use VOP_GETVOBJECT, use vp->v_object directly.


140778 24-Jan-2005 phk

Create a vnode object when the file is opened. Trust that we did so.


140774 24-Jan-2005 phk

Don't create vnode_pager objects for the disk device.
geom_vfs will do that.


140768 24-Jan-2005 phk

Create a vp->v_object in VFS_FHTOVP() if we want to be exportable
with NFS.

We are moving responsibility for creating the vnode_pager object into
the filesystems which own the vnode, and this is one of the places
we have to cover.

We call vnode_create_vobject() directly because we own the vnode.

If we can get the size easily, pass it as an argument to save the
call to VOP_GETATTR() in vnode_create_vobject()


140729 24-Jan-2005 phk

Polish style.


140709 24-Jan-2005 jeff

- Convert the global LK lock to a mutex.
- Expand the scope of lk to cover not only interrupt races, but also
top-half races, which includes many new uses over global top-half
only data.
- Get rid of interlocked_sleep() and use msleep or BUF_LOCK where
appropriate.
- Use the lk mutex in place of the various hand rolled semaphores.
- Stop dropping the lk lock before we panic.
- Fix getdirtybuf() callers so that they reacquire access to whatever
softdep datastructure they were inxpecting in the failure/retry
case. Previously, sleeps in getdirtybuf() could leave us with
pointers to bad memory.
- Update handling of ffs to be compatible with ffs locking changes.

Sponsored By: Isilon Systems, Inc.


140708 24-Jan-2005 jeff

- Initialize and destroy the per-filesystem ufs lock where appropriate.
- Use the buffer lock on the superblock buf to serialize calls to
sbupdate.
- Set the MNTK_MPSAFE flag when QUOTA is not defined in the kernel.

Sponsored By: Isilon Systems, Inc.


140707 24-Jan-2005 jeff

- Remove GIANT_REQUIRED where giant is no longer required.

Sponsored By: Isilon Systems, Inc.


140706 24-Jan-2005 jeff

- Use the ufs lock to protect fs_active.

Sponsored By: Isilon Systems, Inc.


140705 24-Jan-2005 jeff

- Acquire the ufs lock around several ffs_alloc functions that require
it.

Sponsored By: Isilon Systems, Inc.


140704 24-Jan-2005 jeff

- Don't use atomic operations to deal with the active array, instead
it is now quite naturally protected by the ufsmount mutex.
- Use the ufs lock to protect various fields in struct fs, primarily the
cg summary needs protection to avoid allocation races. Several
functions have been slightly re-arranged to reduce the number of
lock operations.
- Adjust several functions (blkfree, freefile, etc.) to accept a
ufsmount as an argument so that we may access the ufs lock.

Sponsored By: Isilon Systems, Inc.


140703 24-Jan-2005 jeff

- Acquire the ufs lock when manipulating some fields of struct fs.
- Change arguments to various ffs functions to match their new
prototypes.

Sponsored By: Isilon Systems, Inc.


140702 24-Jan-2005 jeff

- Mark the struct fs members that require the ufsmount mutex.
- Define some macros for manipulating the fs_active bitmap.

Sponsored By: Isilon Systems, Inc.


140701 24-Jan-2005 jeff

- Change some function parameters so that the ufsmount structure is
accessable in places where the ufs lock will be needed.

Sponsored By: Isilon Systems, Inc.


140700 24-Jan-2005 jeff

- Add a mutex to the ufsmount structure. This mutex is used to protect
any per-instance global data that is not already protected by a
buf or vnode lock. Presently, only fields in ffs's struct fs utilize
this lock.
- Sort some ufsmount members so that fields used for quotas are grouped
together. This is in anticipation of quota locking.

Sponsored By: Isilon Systems, Inc.


140306 15-Jan-2005 pjd

Fix ACLs handling for the root file system.
Without this fix, when ACLs are set via tunefs(8) on the root file system,
they are removed on boot when 'mount -a' is called, because mount(8)
called for the root file system always add MNT_UPDATE flag and MNT_UPDATE
flag isn't perfect.
Now, one cannot remove ACLs stored in superblock (configured with tunefs(8))
via 'mount -a' nor 'mount -u -o noacls <file system>', but it is still
possible to mount file system which doesn't have ACLs in superblock via
'mount -o acls <file system>' or /etc/fstab's 'acls' option.

Reported by: Lech Lorens/pl.comp.os.bsd
Discussed with: phk, rwatson
Reviewed by: rwatson
MFC after: 2 weeks


140220 14-Jan-2005 phk

Eliminate unused and unnecessary "cred" argument from vinvalbuf()


140181 13-Jan-2005 phk

Ditch vfs_object_create() and make the callers call VOP_CREATEVOBJECT()
directly.


140056 11-Jan-2005 phk

Add BO_SYNC() and add a default which uses the secret vnode pointer
and VOP_FSYNC() for now.


140051 11-Jan-2005 phk

Wrap the bufobj operations in macros: BO_STRATEGY() and BO_WRITE()


140048 11-Jan-2005 phk

Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC().

I'm not sure why a credential was added to these in the first place, it is
not used anywhere and it doesn't make much sense:

The credentials for syncing a file (ability to write to the
file) should be checked at the system call level.

Credentials for syncing one or more filesystems ("none")
should be checked at the system call level as well.

If the filesystem implementation needs a particular credential
to carry out the syncing it would logically have to the
cached mount credential, or a credential cached along with
any delayed write data.

Discussed with: rwatson


139825 07-Jan-2005 imp

/* -> /*- for license, minor formatting changes


138869 14-Dec-2004 phk

white space


138868 14-Dec-2004 phk

Implement simpler panics for VOP_{read,write} on fifos.


138814 13-Dec-2004 imp

LINT defines things which compile in code that as referring to the old
a_desc element. change this to the new a_gen.a_desc to reflect
changes to vnode_if.h generation.

Noticed by: tinderbox, phk


138744 12-Dec-2004 phk

With the introduction of UFS2 we started looking for superblocks in
four different locations on a prospective filesystem.

If we found none, we forgot to invalidate the four buffers, thus the
following sequence would fails:

(md0 = blank disk)
mount /dev/md0 /mnt
(fails, no superblocks)
newfs /dev/md0
(writes using physio which does not go through buffercache).
mount /dev/md0 /mnt
(still fails, the four cached buffers still contain no superblocks)

Found by: ru


138700 11-Dec-2004 marcel

Revert previous commit. The null-pointer function call (a dereference
on ia64) was not the result of a change in the vector operations. It
was caused by the NFS locking code using a FIFO and those bypassing
the vnode. This indirectly caused the panic. The NFS locking code has
been changed.

Requested by: phk


138634 09-Dec-2004 mckusick

Fixes a bug that caused UFS2 filesystems bigger than 2TB to
prematurely report that they were full and/or to panic the kernel
with the message ``ffs_clusteralloc: allocated out of group''.

Submitted by: Henry Whincup <henry@jot.to>
MFC after: 1 week


138557 08-Dec-2004 phk

Fix snapshot creation.


138517 07-Dec-2004 phk

Fix nfs exports (for now). The real fix is to teach mountd about
nmount.


138509 07-Dec-2004 phk

The remaining part of nmount/omount/rootfs mount changes. I cannot sensibly
split the conversion of the remaining three filesystems out from the root
mounting changes, so in one go:

cd9660:
Convert to nmount.
Add omount compat shims.
Remove dedicated rootfs mounting code.
Use vfs_mountedfrom()
Rely on vfs_mount.c calling VFS_STATFS()

nfs(client):
Convert to nmount (the simple way, mount_nfs(8) is still necessary).
Add omount compat shims.
Drop COMPAT_PRELITE2 mount arg compatibility.

ffs:
Convert to nmount.
Add omount compat shims.
Remove dedicated rootfs mounting code.
Use vfs_mountedfrom()
Rely on vfs_mount.c calling VFS_STATFS()

Remove vfs_omount() method, all filesystems are now converted.

Remove MNTK_WANTRDWR, handling RO/RW conversions is a filesystem
task, and they all do it now.

Change rootmounting to use DEVFS trampoline:

vfs_mount.c:
Mount devfs on /. Devfs needs no 'from' so this is clean.
symlink /dev to /. This makes it possible to lookup /dev/foo.
Mount "real" root filesystem on /.
Surgically move the devfs mountpoint from under the real root
filesystem onto /dev in the real root filesystem.

Remove now unnecessary getdiskbyname().

kern_init.c:
Don't do devfs mounting and rootvnode assignment here, it was
already handled by vfs_mount.c.

Remove now unused bdevvp(), addaliasu() and addalias(). Put the
few necessary lines in devfs where they belong. This eliminates the
second-last source of bogo vnodes, leaving only the lemming-syncer.

Remove rootdev variable, it doesn't give meaning in a global context and
was not trustworth anyway. Correct information is provided by
statfs(/).


138412 05-Dec-2004 phk

VFS_STATFS(mp, ...) is mostly called with &mp->mnt_stat, but a few cases
doesn't. Most of the implementations have grown weeds for this so they
copy some fields from mnt_stat if the passed argument isn't that.

Fix this the cleaner way: Always call the implementation on mnt_stat
and copy that in toto to the VFS_STATFS argument if different.


138411 05-Dec-2004 marcel

Fix null-pointer indirect function calls introduced in the previous
commit. In the new world order, the transitive closure on the vector
operations is not precomputed. As such, it's unsafe to actually use
any of the function pointers in an indirect function call. They can
be null, and we need to use the default vector in that case.
This is mostly a quick fix for the four function pointers that are
ed explicitly. A more generic or scalable solution is likely to see
the light of day.

No pathos on: current@


138359 03-Dec-2004 phk

typo in comment.


138290 01-Dec-2004 phk

Back when VOP_* was introduced, we did not have new-style struct
initializations but we did have lofty goals and big ideals.

Adjust to more contemporary circumstances and gain type checking.

Replace the entire vop_t frobbing thing with properly typed
structures. The only casualty is that we can not add a new
VOP_ method with a loadable module. History has not given
us reason to belive this would ever be feasible in the the
first place.

Eliminate in toto VOCALL(), vop_t, VNODEOP_SET() etc.

Give coda correct prototypes and function definitions for
all vop_()s.

Generate a bit more data from the vnode_if.src file: a
struct vop_vector and protype typedefs for all vop methods.

Add a new vop_bypass() and make vop_default be a pointer
to another struct vop_vector.

Remove a lot of vfs_init since vop_vector is ready to use
from the compiler.

Cast various vop_mumble() to void * with uppercase name,
for instance VOP_PANIC, VOP_NULL etc.

Implement VCALL() by making vdesc_offset the offsetof() the
relevant function pointer in vop_vector. This is disgusting
but since the code is generated by a script comparatively
safe. The alternative for nullfs etc. would be much worse.

Fix up all vnode method vectors to remove casts so they
become typesafe. (The bulk of this is generated by scripts)


138270 01-Dec-2004 phk

Mechanically change prototypes for vnode operations to use the new typedefs.


138075 25-Nov-2004 phk

Use system wide no-op vfs_start function.


137846 18-Nov-2004 jeff

- Eliminate the acquisition and release of the bqlock in bremfree() by
setting the B_REMFREE flag in the buf. This is done to prevent lock order
reversals with code that must call bremfree() with a local lock held.
This also reduces overhead by removing two lock operations per buf for
fsync() and similar.
- Check for the B_REMFREE flag in brelse() and bqrelse() after the bqlock
has been acquired so that we may remove ourself from the free-list.
- Provide a bremfreef() function to immediately remove a buf from a
free-list for use only by NFS. This is done because the nfsclient code
overloads the b_freelist queue for its own async. io queue.
- Simplify the numfreebuffers accounting by removing a switch statement
that executed the same code in every possible case.
- getnewbuf() can encounter locked bufs on free-lists once Giant is removed.
Remove a panic associated with this condition and delay asserts that
inspect the buf until after it is locked.

Reviewed by: phk
Sponsored by: Isilon Systems, Inc.


137726 15-Nov-2004 phk

Make VOP_BMAP return a struct bufobj for the underlying storage device
instead of a vnode for it.

The vnode_pager does not and should not have any interest in what
the filesystem uses for backend.

(vfs_cluster doesn't use the backing store argument.)


137657 13-Nov-2004 phk

Be prepared to accept NULL mountargs as part of root-mounting.


137608 12-Nov-2004 phk

Put back the vfs_object_create() calls, they do make a difference when
my test-setup does what I want it to instead of what I ask it to.

Pointed out by: tegge


137504 10-Nov-2004 phk

fix some comments


137491 09-Nov-2004 phk

Use mount flags instead of NULL path to detect root filesystem mount.


137486 09-Nov-2004 phk

Stop pretending to have a vm_object backing the underlying disk vnode:
it isn't used for anything anywhere and the vnode_pager would explode
if we attempted to.


137308 06-Nov-2004 phk

Properly implement a default version of VOP_GETWRITEMOUNT.

Remove improper access to vop_stdgetwritemount() which should and
will instead rely on the VOP default path.


137194 04-Nov-2004 phk

Don't grab the exclusive bit on a root filesystem until we are willing
to mount it. Doing so prevented fsck to be run after a refused mount.


137035 29-Oct-2004 phk

Move UFS from DEVFS backing to GEOM backing.

This eliminates a bunch of vnode overhead (approx 1-2 % speed
improvement) and gives us more control over the access to the storage
device.

Access counts on the underlying device are not correctly tracked and
therefore it is possible to read-only mount the same disk device multiple
times:
syv# mount -p
/dev/md0 /var ufs rw 2 2
/dev/ad0 /mnt ufs ro 1 1
/dev/ad0 /mnt2 ufs ro 1 1
/dev/ad0 /mnt3 ufs ro 1 1

Since UFS/FFS is not a synchrousely consistent filesystem (ie: it caches
things in RAM) this is not possible with read-write mounts, and the system
will correctly reject this.

Details:

Add a geom consumer and a bufobj pointer to ufsmount.

Eliminate the vnode argument from softdep_disk_prewrite().
Pick the vnode out of bp->b_vp for now. Eventually we
should find it through bp->b_bufobj->b_private.

In the mountcode, use g_vfs_open() once we have used
VOP_ACCESS() to check permissions.

When upgrading and downgrading between r/o and r/w do the
right thing with GEOM access counts. Remove all the
workarounds for not being able to do this with VOP_OPEN().

If we are the root mount, drop the exclusive access count
until we upgrade to r/w. This allows fsck of the root
filesystem and the MNT_RELOAD to work correctly.

Set bo_private to the GEOM consumer on the device bufobj.

Change the ffs_ops->strategy function to call g_vfs_strategy()

In ufs_strategy() directly call the strategy on the disk
bufobj. Same in rawread.

In ffs_fsync() we will no longer see VCHR device nodes, so
remove code which synced the filesystem mounted on it, in
case we came there. I'm not sure this code made sense in
the first place since we would have taken the specfs route
on such a vnode.

Redo the highly bogus readblock() function in the snapshot
code to something slightly less bogus: Constructing an uio
and using physio was really quite a detour. Instead just
fill in a bio and ship it down.


137007 28-Oct-2004 phk

We only support backing UFS/FFS with disks.


136988 27-Oct-2004 phk

Eliminate unnecessary KASSERTS.


136982 26-Oct-2004 phk

KASSERT that we only get to prewrite() on writes.


136981 26-Oct-2004 phk

White space changes. Add missing static.


136980 26-Oct-2004 phk

Replace single case switch() with if().


136979 26-Oct-2004 phk

Vertically align comment.


136969 26-Oct-2004 phk

The island council met and voted buf_prewrite() home.

Give ffs it's own bufobj->bo_ops vector and create a private strategy
routine, (currently misnamed for forwards compatibility), which is
just a copy of the generic bufstrategy routine except we call
softdep_disk_prewrite() directly instead of through the buf_prewrite()
indirection.

Teach UFS about the need for softdep_disk_prewrite() and call the
function directly in FFS.

Remove buf_prewrite() from the default bufstrategy() and from the
global bio_ops method vector.


136968 26-Oct-2004 phk

Fix syntax errors introduced by last commit.

Why isn't DIRECTIO in NOTES/LINT ?


136966 26-Oct-2004 phk

Put the I/O block size in bufobj->bo_bsize.

We keep si_bsize_phys around for now as that is the simplest way to pull
the number out of disk device drivers in devfs_open(). The correct solution
would be to do an ioctl(DIOCGSECTORSIZE), but the point is probably mooth
when filesystems sit on GEOM, so don't bother for now.


136963 26-Oct-2004 phk

Degeneralize the per cdev copyonwrite callback. The only possible value
is ffs_copyonwrite() and the only place it can be called from is FFS which
would never want to call another filesystems copyonwrite method, should one
exist, so there is no reason why anything generic should know about this.


136943 25-Oct-2004 phk

Loose the v_dirty* and v_clean* alias macros.

Check the count field where we just want to know the full/empty state,
rather than using TAILQ_EMPTY() or TAILQ_FIRST().


136941 25-Oct-2004 phk

Remove vnode->v_bsize. This was a dead-end.


136927 24-Oct-2004 phk

Move the buffer method vector (buf->b_op) to the bufobj.

Extend it with a strategy method.

Add bufstrategy() which do the usual VOP_SPECSTRATEGY/VOP_STRATEGY
song and dance.

Rename ibwrite to bufwrite().

Move the two NFS buf_ops to more sensible places, add bufstrategy
to them.

Add inlines for bwrite() and bstrategy() which calls through
buf->b_bufobj->b_ops->b_{write,strategy}().

Replace almost all VOP_STRATEGY()/VOP_SPECSTRATEGY() calls with bstrategy().


136767 22-Oct-2004 phk

Add b_bufobj to struct buf which eventually will eliminate the need for b_vp.

Initialize b_bufobj for all buffers.

Make incore() and gbincore() take a bufobj instead of a vnode.

Make inmem() local to vfs_bio.c

Change a lot of VI_[UN]LOCK(bp->b_vp) to BO_[UN]LOCK(bp->b_bufobj)
also VI_MTX() to BO_MTX(),

Make buf_vlist_add() take a bufobj instead of a vnode.

Eliminate other uses of bp->b_vp where bp->b_bufobj will do.

Various minor polishing: remove "register", turn panic into KASSERT,
use new function declarations, TAILQ_FOREACH_SAFE() etc.


136751 21-Oct-2004 phk

Move the VI_BWAIT flag into no bo_flag element of bufobj and call it BO_WWAIT

Add bufobj_wref(), bufobj_wdrop() and bufobj_wwait() to handle the write
count on a bufobj. Bufobj_wdrop() replaces vwakeup().

Use these functions all relevant places except in ffs_softdep.c where
the use if interlocked_sleep() makes this impossible.

Rename b_vnbufs to b_bobufs now that we touch all the relevant files anyway.


136721 20-Oct-2004 rwatson

Explicitly break out NETA license from Berkeley license to clearly
indicate license grant, as well as to indicate that NETA is asserting
only two clauses, not four clauses.

Requested by: imp


136336 09-Oct-2004 njl

Fix fsbtodb() for UFS1. This fixes an overflow for file sizes >1 TB,
allowing for sizes up to 4 TB. This doesn't affect UFS2 since b is already
a 64 bit type, coincidental with daddr_t.

Submitted by: bde


136144 05-Oct-2004 pjd

Back out changes which were introduced to delay mounting root file system.
Those changes were made on gmirror needs, but now gmirror handles this
by itself.


135877 28-Sep-2004 phk

Remove support for accessing device nodes in UFS/FFS.

Device nodes can still be created and exported with NFS.


135858 27-Sep-2004 phk

Give cluster_write() an explicit vnode argument.

In the future a struct buf will not automatically point out a vnode for us.


135612 23-Sep-2004 pjd

Introduce new /boot/loader.conf variable: root_mount_delay.
It can be used to delay mounting root partition to give a chance to GEOM
providers to show up.
Now, when there is no needed provider, vfs_rootmount() function will look
for it every second and if it can't be find in defined time, it'll ask
for root device name (before this change it was done immediately).

This will allow to boot from gmirror device in degraded mode.


135459 19-Sep-2004 phk

The getpages VOP was a good stab at getting scatter/gather I/O without
too much kernel copying, but it is not the right way to do it, and it is
in the way for straightening out the buffer cache.

The right way is to pass the VM page array down through the struct
bio to the disk device driver and DMA directly in to/out off the
physical memory. Once the VM/buf thing is sorted out it is next on
the list.

Retire most of vnode method. ffs_getpages(). It is not clear if what is
left shouldn't be in the default implementation which we now fall back to.

Retire specfs_getpages() as well, as it has no users now.


135312 16-Sep-2004 phk

Do not traverse list of snapshots if there isn't one.

Found by: scottl


135303 16-Sep-2004 phk

Missed a place where snapshots were allocated in my last commit to
this file.


135138 13-Sep-2004 phk

Create struct snapdata which contains the snapshot fields from cdev
and the previously malloc'ed snapshot lock.

Malloc struct snapdata instead of just the lock.

Replace snapshot fields in cdev with pointer to snapdata (saves 16 bytes).

While here, give the private readblock() function a vnode argument
in preparation for moving UFS to access GEOM directly.


135135 13-Sep-2004 phk

Remove the buffercache/vnode side of BIO_DELETE processing in
preparation for integration of p4::phk_bufwork. In the future,
local filesystems will talk to GEOM directly and they will consequently
be able to issue BIO_DELETE directly. Since the removal of the fla
driver, BIO_DELETE has effectively been a no-op anyway.


134899 07-Sep-2004 phk

Create simple function init_va_filerev() for initializing a va_filerev
field.

Replace three instances of longhaired initialization va_filerev fields.

Added XXX comment wondering why we don't use random bits instead of
uptime of the system for this purpose.


134143 22-Aug-2004 csjp

Currently, if the secure level is low enough, system flags can
be manipulated by prison root. In 4.x prison root can not manipulate
system flags, regardless of the security level. This behavior
should remain consistent to avoid any surprises which could lead
to security problems for system administrators which give out
privileged access to jails.

This commit changes suser_cred's flag argument from SUSER_ALLOWJAIL
to 0. This will prevent prison root from being able to manipulate
system flags on files.

This may be a MFC candidate for RELENG_5.

Discussed with: cperciva
Reviewed by: rwatson
Approved by: bmilekic (mentor)
PR: kern/70298


134011 19-Aug-2004 jhb

Generalize the UFS bad magic value used to determine when a filesystem
has only been partly initialized via newfs(8) so that it applies to both
UFS1 and UFS2.

Submitted by: "Xin LI" delphij at frontfree dot net
MFC: maybe?


133837 16-Aug-2004 dwmalone

When looking for some extra data to include in the hash, use the
address of the dirhash, rather than the first sizeof(struct dirhash
*) bytes of the structure (which, thankfully, seem to be constant).

Submitted by: Ted Unangst <tedu@zeitbombe.org>
MFC after: 2 weeks


133741 15-Aug-2004 jmg

Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by: green, rwatson (both earlier versions)


133327 08-Aug-2004 phk

use bufdone() not biodone().


132902 30-Jul-2004 phk

Put a version element in the VFS filesystem configuration structure
and refuse initializing filesystems with a wrong version. This will
aid maintenance activites on the 5-stable branch.

s/vfs_mount/vfs_omount/

s/vfs_nmount/vfs_mount/

Name our filesystems mount function consistently.

Eliminate the namiedata argument to both vfs_mount and vfs_omount.
It was originally there to save stack space. A few places abused
it to get hold of some credentials to pass around. Effectively
it is unused.

Reorganize the root filesystem selection code.


132805 28-Jul-2004 phk

Remove global variable rootdevs and rootvp, they are unused as such.

Add local rootvp variables as needed.

Remove checks for miniroot's in the swappartition. We never did that
and most of the filesystems could never be used for that, but it had
still been copy&pasted all over the place.


132775 28-Jul-2004 kan

Avoid using casts as lvalues. Introduce DIP_SET macro which sets proper
inode field based on UFS version. Use DIP ro read values and DIP_SET
to modify them throughout FFS code base.


132653 26-Jul-2004 cperciva

Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.

The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)

Discussed with: rwatson, scottl
Requested by: jhb


132154 14-Jul-2004 phk

Make sure to update the mnt_stats before UFS1 extattr tried to
do I/O on the device. Otherwise the blocksize is undefined in the
buffer cache.


132023 12-Jul-2004 alfred

Make VFS_ROOT() and vflush() take a thread argument.
This is to allow filesystems to decide based on the passed thread
which vnode to return.
Several filesystems used curthread, they now use the passed thread.


131907 10-Jul-2004 marcel

Update for the KDB debugger framework:
o Make debugging code conditional upon KDB.
o Use kdb_backtrace() instead of backtrace().
o Remove inclusion of opt_ddb.h.


131756 07-Jul-2004 phk

Explicity initialize vp->v_bsize.


131551 04-Jul-2004 phk

When we traverse the vnodes on a mountpoint we need to look out for
our cached 'next vnode' being removed from this mountpoint. If we
find that it was recycled, we restart our traversal from the start
of the list.

Code to do that is in all local disk filesystems (and a few other
places) and looks roughly like this:

MNT_ILOCK(mp);
loop:
for (vp = TAILQ_FIRST(&mp...);
(vp = nvp) != NULL;
nvp = TAILQ_NEXT(vp,...)) {
if (vp->v_mount != mp)
goto loop;
MNT_IUNLOCK(mp);
...
MNT_ILOCK(mp);
}
MNT_IUNLOCK(mp);

The code which takes vnodes off a mountpoint looks like this:

MNT_ILOCK(vp->v_mount);
...
TAILQ_REMOVE(&vp->v_mount->mnt_nvnodelist, vp, v_nmntvnodes);
...
MNT_IUNLOCK(vp->v_mount);
...
vp->v_mount = something;

(Take a moment and try to spot the locking error before you read on.)

On a SMP system, one CPU could have removed nvp from our mountlist
but not yet gotten to assign a new value to vp->v_mount while another
CPU simultaneously get to the top of the traversal loop where it
finds that (vp->v_mount != mp) is not true despite the fact that
the vnode has indeed been removed from our mountpoint.

Fix:

Introduce the macro MNT_VNODE_FOREACH() to traverse the list of
vnodes on a mountpoint while taking into account that vnodes may
be removed from the list as we go. This saves approx 65 lines of
duplicated code.

Split the insmntque() which potentially moves a vnode from one mount
point to another into delmntque() and insmntque() which does just
what the names say.

Fix delmntque() to set vp->v_mount to NULL while holding the
mountpoint lock.


131072 24-Jun-2004 rwatson

Annotate that we don't check the returned data length from ufs_readdir()
because UFS uses fixed-size directory blocks. When using this code with
other file systems, such as HFS+, the value of auio.uio_resid will need
to be taken into account.


131069 24-Jun-2004 rwatson

Remove unnecessary setting of VV_SYSTEM on extended attribute backing
files. When this flag is used in our port of this code to Darwin, it
caused remarkable pain, and doesn't offer a benefit in FreeBSD.


131067 24-Jun-2004 rwatson

Protect a non-text comment with a '-'.


131066 24-Jun-2004 rwatson

White space cleanup: use spaces instead of tabs in variable declarations
local to a function. Remove a couple of blank lines in variable
declarations.

In one case, explicitly test against NULL rather than using a pointer
as a boolean directly.


130761 20-Jun-2004 bde

Backed out previous commit. The dev_t -> `struct cdev *' changes have
lots of errors. Blind substitution of "dev_t foo" by "struct cdev *foo"
in comments usually just created an English syntax error (e.g.,
"struct cdev *changes"), but here it did less than that since the dev_t
is a user dev_t.


130690 18-Jun-2004 kuriyama

Avoid deadlock which is caused by locking VDIR of parent and VREG of
snapshot itself in wrong order.
We can skip unlink check of that directory because it must have
snapshot in it.

Reviewed by: mckusick and current@


130585 16-Jun-2004 phk

Do the dreaded s/dev_t/struct cdev */
Bump __FreeBSD_version accordingly.


130551 16-Jun-2004 julian

Nice, is a property of a process as a whole..
I mistakenly moved it to the ksegroup when breaking up the process
structure. Put it back in the proc structure.


130246 08-Jun-2004 stefanf

Avoid assignments to cast expressions.

Reviewed by: md5
Approved by: das (mentor)


130023 03-Jun-2004 tjr

Move TDF_DEADLKTREAT into td_pflags (and rename it accordingly) to avoid
having to acquire sched_lock when manipulating it in lockmgr(), uiomove(),
and uiomove_fromphys().

Reviewed by: jhb


129895 31-May-2004 krion

- Fix typo

Approved by: tobez


129545 21-May-2004 kensmith

Upon further review it was decided this piece of the msync(2)
fixes was applicable to HEAD, originally it was thought this
should only be done in RELENG_4. Implement IO_INVAL in the vnode
op for writing by marking the buffer as "no cache". This fix
has already been applied to RELENG_4 as Rev. 1.65.2.15 of
ufs/ufs/ufs_readwrite.c.

Reviewed by: alc, tegge


129450 19-May-2004 kensmith

Style fixup in previous commit.

Noticed by: bde (thanks!)


129244 14-May-2004 kensmith

Change ffs_realloccg() to set the valid bits for the extended part of the
fragment to zero the valid parts of a VM_IO buffer.

RE would like this to be part of 4.10-RC3 so this will be MFC-ed immediately.

Reviewed by: alc, tegge


128740 29-Apr-2004 bmilekic

Revert previous change to this file because it breaks some
things which compare /etc/fstab entries to results from
getfsstat(). The real way to fix this is to make 'ufs2'
a recognized filesystem (for real, no beating around the
bush).

This should fix things like 'umount -a -t ufs' now.
Appologies for the previous breakage.


128658 26-Apr-2004 bmilekic

The previous change to mount(8) to report ufs or ufs2 used
libufs, which only works for Charlie root.

This change reverts the introduction of libufs and moves the
check into the kernel. Since the f_fstypename is the same
for both ufs and ufs2, we check fs_magic for presence of
ufs2 and copy "ufs2" explicitly instead.

Submitted by: Christian S.J. Peron <maneo@bsdpro.com>


128006 07-Apr-2004 bde

Record where half the bits in this file came from (from ufs_readwrite.c).
Damage to history from moving bits was especially large since a repo copy
is not feasible for partial files.


127975 07-Apr-2004 imp

Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and irc message from Robert
Watson saying that clause 3 can be removed from those files with an
NAI copyright that also have only a University of California
copyrights.

Approved by: core, rwatson


127955 06-Apr-2004 jhb

Fix a paste-o from the buf_prewrite() cleanup commit and check for the
MNTK_SUSPEND flag on the correct vnode pointer in softdep_disk_prewrite().

Reviewed by: phk
Tested by: kensmith


127818 03-Apr-2004 mux

Fix the remaining warnings of growfs(8) on my sparc64 box with
WARNS=6. I don't change the WARNS level in the Makefile because I
didn't tested this on other archs.

The fs.h fix was suggested by: marcel
Reviewed by: md5(1)


127095 16-Mar-2004 kan

Avoid doing bawrite to initialize inode block while holding cylinder
group block locked. If filesystem has any active snapshots, bawrite
can come back trying to allocate new snapshot data block from the same
cylinder group and cause panic due to recursive lock attempt.

PR: 64206
Reviewed by: mckusick
Tested by: pjd


126858 11-Mar-2004 phk

When I was a kid my work table was one cluttered mess an cleaning it up
were a rather overwhelming task. I soon learned that if you don't know
where you're going to store something, at least try to pile it next to
something slightly related in the hope that a pattern emerges.

Apply the same principle to the ffs/snapshot/softupdates code which have
leaked into specfs: Add yet a buf-quasi-method and call it from the
only two places I can see it can make a difference and implement the
magic in ffs_softdep.c where it belongs.

It's not pretty, but at least it's one less layer violated.


126853 11-Mar-2004 phk

Properly vector all bwrite() and BUF_WRITE() calls through the same path
and s/BUF_WRITE()/bwrite()/ since it now does the same as bwrite().


126170 23-Feb-2004 mckusick

A more accurate test in the new ufs_lock than that in 1.235.


126154 23-Feb-2004 mckusick

In the function clear_inodedeps(), a FREE_LOCK() should be called
AFTER the call to vn_start_write(), not before it. Otherwise, it is
possible to unlock it multiple times if the vn_start_write() fails.

Submitted by: Juergen Hannken-Illjes <hannken@eis.cs.tu-bs.de>


126153 23-Feb-2004 mckusick

Change UFS from using vop_stdlock to using its own ufs_lock.
In ufs_lock, check for attempts to acquire shared locks on
snapshot files and change them to be exclusive locks. This
change eliminates deadlocks and machine lockups reported in
-current since most read requests started using shared lock
requests.

Submitted by: Jun Kuriyama <kuriyama@imgsrc.co.jp>


126097 22-Feb-2004 rwatson

Update my personal copyrights and NETA copyrights in the kernel
to use the "year1-year3" format, as opposed to "year1, year2, year3".
This seems to make lawyers more happy, but also prevents the
lines from getting excessively long as the years start to add up.

Suggested by: imp


125854 15-Feb-2004 dwmalone

Abstract dirhash's locking using macros. This should make it easier to
use the same dirhash code on different branches/platforms.

Reviewed by: Ted Unangst <tedu@zeitbombe.org>
Reviewed by: iedowse
MFC after: 3 weeks


125796 14-Feb-2004 bde

Fixed some style bugs:
- don't unlock the vnode after vinvalbuf() only to have to relock it
almost immediately.
- don't refer to devices classified by vn_isdisk() as block devices.


125765 13-Feb-2004 bde

MFextfs: backed out secondary changes in rev.1.40 that had become just
style bugs (a variable that is used only once, and misformattings).


125764 13-Feb-2004 kuriyama

Fix style bugs in previous commit.

Submitted by: bde


125738 12-Feb-2004 bde

Fixed some minor style bugs (English usage and formatting of binary
operators) in and near revs.1.169-1.170 (open mode bandaid). This
(or better a proper fix) should have been done before cloning the
bandaid to many other file systems.


125732 12-Feb-2004 kuriyama

Reverse lock order by using local variable. This will shut up "acquiring
duplicate lock of same type" message.

Reviewed by: mckusick


125710 11-Feb-2004 bde

Removed more vestiges of vfs_ioopt:
- rev.1.42 of ffs_readwrite.c added a special case in ffs_read() for reads
that are initially at EOF, and rev.1.62 of ufs_readwrite.c fixed
timestamp bugs in it. Removal of most of vfs_ioopt made it just and
optimization, and removal of the vm object reference calls made it less
than an optimization. It was cloned in rev.1.94 of ufs_readwrite.c as
part of cloning ffs_extwrite() although it was always less than an
optimization in ffs_extwrite().
- some comments, compound statements and vertical whitespace were vestiges
of dead code.


125454 04-Feb-2004 jhb

Locking for the per-process resource limits structure.
- struct plimit includes a mutex to protect a reference count. The plimit
structure is treated similarly to struct ucred in that is is always copy
on write, so having a reference to a structure is sufficient to read from
it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
limits from a process to keep the limit structure from changing out from
under you while reading from it.
- Various global limits that are ints are not protected by a lock since
int writes are atomic on all the archs we support and thus a lock
wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
either an rlimit, or the current or max individual limit of the specified
resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
(it didn't used the stackgap when it should have) but uses lim_rlimit()
and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits. It
also no longer uses the stackgap for accessing sysctl's for the
ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result,
ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.

Submitted by: mtm (mostly, I only did a few cleanups and catchups)
Tested on: i386
Compiled on: alpha, amd64


125259 31-Jan-2004 alc

Remove unnecessary vm object reference and deallocate calls from ffs_read()
and ffs_write(). These calls trace their origins to the dead vfs_ioopt
code, first appearing in revision 1.39 of ufs_readwrite.c.

Observed by: bde
Discussed with: tegge


125079 27-Jan-2004 ache

Turn uio_resid/uio_offset comments into KASSERTs

Reviewed by: bde


124857 23-Jan-2004 ache

Copy comment about caller check from ffs_read to ffs_extread, don't
check for uio_resid < 0 here too.


124856 23-Jan-2004 ache

Fix various panic() strings to reflect true function name to allow
easy grep.
Small code reorganization to look more logic.
Copy ffs_write check from prev. commit to ffs_extwrite.


124855 23-Jan-2004 ache

ffs_read:
Replace wrong check returned EFBIG with EOVERFLOW handling from POSIX:

36708 [EOVERFLOW] The file is a regular file, nbyte is greater than 0, the
starting position is before the end-of-file, and the starting position is
greater than or equal to the offset maximum established in the open file
description associated with fildes.

ffs_write:
Replace u_int64_t cast with uoff_t cast which is more natural for types
used.

ffs_write & ffs_read:
Remove uio_offset and uio_resid checks for negative values, the caller
supposed to do it already. Add comments about it.

Reviewed by: bde


124728 19-Jan-2004 kan

Spell magic '16' number as IO_SEQSHIFT.


124119 04-Jan-2004 kan

Avoid calling vprint on a vnode while holding its interlock mutex.
Move diagnostic printf after vget. This might delay the debug
output some, but at least it keeps kernel from exploding if
DEBUG_VFS_LOCKS is in effect.


123217 07-Dec-2003 truckman

Set fs_ronly to the correct value in ffs_reload() when reloading the file
system super block after fsck has repaired the file system. The value of
fs_ronly was getting overwritten, which caused ffs_update() to attempt to
update inode timestamps even though the file system was still mounted
read-only.

This fixes the "giving up on N buffers" error that is triggered by running
fsck on the root file system and then rebooting without mounting the file
system read-write.


122783 16-Nov-2003 wes

Write the UFS2 superblock with a 'BAD' magic number at the beginning
of newfs, to signify the newfs operation has not yet completed. Re-
write the superblock with the correct magic number once all of the
cylinder groups have been created to show the operation has finished.

Sponsored by: St. Bernard Software


122747 15-Nov-2003 phk

Send B_PHYS out to pasture, it no longer serves any function.


122596 13-Nov-2003 alc

Call free(9) after the vnode interlock is released, avoiding a lock-order
reversal.


122537 12-Nov-2003 mckusick

Update the statfs structure with 64-bit fields to allow
accurate reporting of multi-terabyte filesystem sizes.

You should build and boot a new kernel BEFORE doing a `make world'
as the new kernel will know about binaries using the old statfs
structure, but an old kernel will not know about the new system
calls that support the new statfs structure. Running an old kernel
after a `make world' will cause programs such as `df' that do a
statfs system call to fail with a bad system call.

Reviewed by: Bruce Evans <bde@zeta.org.au>
Reviewed by: Tim Robbins <tjr@freebsd.org>
Reviewed by: Julian Elischer <julian@elischer.org>
Reviewed by: the hoards of <arch@freebsd.org>
Sponsored by: DARPA & NAI Labs.


122091 05-Nov-2003 kan

Remove mntvnode_mtx and replace it with per-mountpoint mutex.
Introduce two new macros MNT_ILOCK(mp)/MNT_IUNLOCK(mp) to
operate on this mutex transparently.

Eventually new mutex will be protecting more fields in
struct mount, not only vnode list.

Discussed with: jeff


121925 03-Nov-2003 kan

Use VOP_UNLOCK/vrele instead of vput. td was erecived as a parameter
and one cannot be sure it is equal to curthread.


121874 02-Nov-2003 kan

Take care not to call vput if thread used in corresponding vget
wasn't curthread, i.e. when we receive a thread pointer to use
as a function argument. Use VOP_UNLOCK/vrele in these cases.

The only case there td != curthread known at the moment is
boot() calling sync with thread0 pointer.

This fixes the panic on shutdown people have reported.


121847 01-Nov-2003 kan

Temporarily undo parts of the stuct mount locking commit by jeff.
It is unsafe to hold a mutex across vput/vrele calls.

This will be redone when a better locking strategy is agreed upon.

Discussed with: jeff


121785 31-Oct-2003 truckman

Tweak the calculation of minbfree in ffs_dirpref() so that only
those cylinder groups that have at least 75% of the average free
space per cylinder group for that file system are considered as
candidates for the creation of a new directory. The previous formula
for minbfree would set it to zero if the file system was more than
75% full, which allowed cylinder groups with no free space at all
to be chosen as candidates for directory creation, which resulted
in an expensive search for free blocks for each file that was
subsequently created in that directory.

Modify the calculation of minifree in the same way.

Decrease maxcontigdirs as the file system fills to decrease the
likelyhood that a cluster of directories will overflow the available
space in a cylinder group.

Reviewed by: mckusick
Tested by: kmarx@vicor.com
MFC after: 2 weeks


121443 23-Oct-2003 jhb

Move the P_COWINPROGRESS flag from being a per-process p_flag to being a
per-thread td_pflag which doesn't require any locks to read or write as it
is only read or written by curthread on itself.

Glanced at by: mckusick


121354 22-Oct-2003 tegge

Initialize bp->b_offset to the physical offset in partition
so GEOM knows where to read from disk.


121205 18-Oct-2003 phk

DuH!

bp->b_iooffset (the spot on the disk), not bp->b_offset (the offset in
the file)


121202 18-Oct-2003 phk

Initialize bp->b_offset before calling VOP_[SPEC]STRATEGY()


121158 17-Oct-2003 mckusick

When expunging unlinked files from a snapshot, skip over holes in the
file rather than panicing with "indiracct: botched params".

Submitted by: Mark Santcroos <marks@ripe.net>


120841 06-Oct-2003 jeff

- My last commit to this file is still not safe, I believe that it may be
due to the recursion in indir_trunc().


120839 06-Oct-2003 jeff

- Reinstate 1.142 this was fixed by 1.144.


120825 05-Oct-2003 jeff

- The VCHR case in ffs_sync() is an unneccsary optimization especially
considering how infrequently we access devices via ffs now that we have
devfs. Collapse this case with the other case.

Obtained from: bde


120805 05-Oct-2003 jeff

- Further simplify ffs_sync(). The vnode lock is required for UFS_UPDATE()
so make the code slightly more uniform. The vnode lock is acquired in
all cases and now the only difference between VCHR and other is we
call UFS_UPDATE instead of VOP_FSYNC().


120804 05-Oct-2003 jeff

- In ffs_update() assert that either the vnode lock or the XLOCK is held.


120793 05-Oct-2003 jeff

- Check the XLOCK before inspecting v_data.
- Slightly rewrite the fsync loop to be more lock friendly. We must
acquire the vnode interlock before dropping the mnt lock. We must
also check XLOCK to prevent vclean() races.
- Use LK_INTERLOCK in the vget() in ffs_sync to further prevent vclean()
races.
- Use a local variable to store the results of the nvp == TAILQ_NEXT
test so that we do not access the vp after we've vrele()d it.
- Add an XXX comment about UFS_UPDATE() not being protected by any lock
here. I suspect that it should need the VOP lock.


120789 05-Oct-2003 jeff

- Skip over xvp if XLOCK is set.


120777 05-Oct-2003 jeff

- Don't cache_purge() in ufs_reclaim. vclean() does it for us so
this is redundant.


120763 04-Oct-2003 alc

Synchronize access to a vm page's valid field using the containing
vm object's lock.


120750 04-Oct-2003 jeff

- The VI assert in getdirtybuf() is only valid if we're not on a VCHR
vnode. VCHR vnodes don't do background writes.

Reported by: kan


120741 04-Oct-2003 jeff

- Increase the scope of the interlock in ffs_reload(). Acquire it before
we release the mntvnode_mtx.
- Call vgonel() directly instead of going through vrecycle() since we own
the interlock now.
- Remove a few cases where we locked the interlock just so that we could
call VOP_UNLOCK with interlock held.


120740 04-Oct-2003 jeff

- Fix an unlocked call to GETATTR by slightly shuffling the code in
ffs_snapshot() around.
- Acquire the interlock before releasing the mntvnode_mtx. Use the
interlock to protect v_usecount access.


120738 04-Oct-2003 jeff

- Use the VI_LOCK macro in two places where we directly called mtx_lock()
before. Direct calls indicated places that needed review and these have
now been reviewed.


120737 04-Oct-2003 jeff

- Properly acquire the vnode interlock before releasing the
mntvnode_mtx.
- Use a local variable to store the results of the test to see if the
next vnode on the mount list has changed. This is so that we no longer
acess the vnode after we vput() it.


120732 04-Oct-2003 jeff

- Remove a mp_fixme() and some locks that weren't necessary. I now
understand how this works.


119707 03-Sep-2003 jeff

- Several of the callers to getdirtybuf() were erroneously changed to pass
in a list head instead of a pointer to the first element at the time of
the first call. These lists are subject to change, and getdirtybuf()
would refetch from the wrong list in some cases.

Spottedy by: tegge
Pointy hat to: me


119604 31-Aug-2003 jeff

- Backout rev 1.142. This caused a deadlock that I do not understand. More
investigation is required.


119603 31-Aug-2003 jeff

- Define a new flag for getblk(): GB_NOCREAT. This flag causes getblk() to
bail out if the buffer is not already present.
- The buffer returned by incore() is not locked and should not be sent to
brelse(). Use getblk() with the new GB_NOCREAT flag to preserve the
desired semantics.


119601 31-Aug-2003 jeff

- Don't acquire the vnode interlock in drain_output(). Instead, require the
caller to acquire it. This permits drain_output() to be done atomically
with other operations as well as reducing the number of lock operations.
- Assert that the proper locks are held in drain_output().
- Change getdirtybuf() to accept a mutex as an argument. This mutex is used
to protect the vnode's buf list and the BKGRDWAIT flag. This lock is
dropped when we successfully acquire a buffer and held on return
otherwise. These semantics reduce the number of cumbersome cases in
calling code.
- Pass the mtx from getdirtybuf() into interlocked_sleep() and allow this
mutex to be used as the interlock argument to BUF_LOCK() in the LOCKBUF
case of interlocked_sleep().
- Change the return value of getdirtybuf() to be the resulting locked buffer
or NULL otherwise. This is for callers who pass in a list head that
requires a lock. It is necessary since the lock that protects the list
head must be dropped in getdirtybuf() so that we don't have a lock order
reversal with the buf queues lock in bremfree().
- Adjust all callers of getdirtybuf() to match the new semantics.
- Add a comment in indir_trunc() that points at unlocked access to a buf.
This may also be one of the last instances of incore() in the tree.


119521 28-Aug-2003 jeff

- Move BX_BKGRDWAIT and BX_BKGRDINPROG to BV_ and the b_vflags field.
- Surround all accesses of the BKGRD{WAIT,INPROG} flags with the vnode
interlock.
- Don't use the B_LOCKED flag and QUEUE_LOCKED for background write
buffers. Check for the BKGRDINPROG flag before recycling or throwing
away a buffer. We do this instead because it is not safe for us to move
the original buffer to a new queue from the callback on the background
write buffer.
- Remove the B_LOCKED flag and the locked buffer queue. They are no longer
used.
- The vnode interlock is used around checks for BKGRDINPROG where it may
not be strictly necessary. If we hold the buf lock the a back-ground
write will not be started without our knowledge, one may only be
completed while we're not looking. Rather than remove the code, Document
two of the places where this extra locking is done. A pass should be
done to verify and minimize the locking later.


119088 18-Aug-2003 alc

The previous change necessitates the addition of a new #include. Otherwise,
there is a compilation warning.


119049 17-Aug-2003 phk

Don't use a VOP_*() function on our own vnodes, go directly to the
relevant internal function, in this case ufs_bmaparray().


118986 16-Aug-2003 alc

Revision 1.44 of ufs/ufs/inode.h has made it necessary to add two new
#includes to this file. Otherwise, it doesn't compile.


118969 15-Aug-2003 phk

Eliminate the i_devvp field from the incore UFS inodes, we can
get the same value from ip->i_ump->um_devvp.

This saves a pointer in the memory copies of inodes, which can
easily run into several hundred kilobytes.

The extra indirection is unmeasurable in benchmarks.

Approved by: mckusick


118607 07-Aug-2003 jhb

Consistently use the BSD u_int and u_short instead of the SYSV uint and
ushort. In most of these files, there was a mixture of both styles and
this change just makes them self-consistent.

Requested by: bde (kern_ktrace.c)


118411 04-Aug-2003 rwatson

Now that the central POSIX.1e ACL code implements functions to
generate the inode mode from a default ACL and creation mask,
implement ufs_sync_inode_from_acl() using acl_posix1e_newfilemode().

Since ACL_OVERRIDE_MASK/ACL_PRESERVE_MASK are defined, we no
longer need to explicitly pass in a "preserve_mask" field: this
is implicit in the use of POSIX.1e semantics.

Note: this change contains a semantic bugfix for new file creation:
we now intersect the ACL-generated mode and the cmode requested by
the user process. This means permissions on newly created file
objects will now be more conservative. In the future, we may want
to provide alternative semantics (similar to Solaris and Linux) in
which the ACL mask overrides the umask, permitting ACLs to broaden
the rights beyond the requested umask.

PR: 50148
Reported by: Ritz, Bruno <bruno_ritz@gmx.ch>
Obtained from: TrustedBSD Project


118404 04-Aug-2003 rwatson

In ufs_chmod(), use privilege only when required in the following
cases:

- Setting sticky bit on non-directory
- Setting setgid on a file with a group that isn't in the effective
or extended groups of the authorizing credential

I.e., test the requirement first, then do the privilege test,
rather than doing the privilege test regardless of the need for
privilege.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


118131 28-Jul-2003 rwatson

Rename VOP_RMEXTATTR() to VOP_DELETEEXTATTR() for consistency with the
kernel ACL interfaces and system call names.

Break out UFS2 and FFS extattr delete and list vnode operations from
setextattr and getextattr to deleteextattr and listextattr, which
cleans up the implementations, and makes the results more readable,
and makes the APIs more clear.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


118094 27-Jul-2003 phk

Add fdidx argument to vn_open() and vn_open_cred() and pass -1 throughout.


118047 26-Jul-2003 phk

Add a "int fd" argument to VOP_OPEN() which in the future will
contain the filedescriptor number on opens from userland.

The index is used rather than a "struct file *" since it conveys a bit
more information, which may be useful to in particular fdescfs and /dev/fd/*

For now pass -1 all over the place.


117221 04-Jul-2003 phk

We just cached the inode pointer, no need to call VTOI() again.


116423 15-Jun-2003 alc

Lock the vm object when freeing pages.


116412 15-Jun-2003 phk

Add the same KASSERT to all VOP_STRATEGY and VOP_SPECSTRATEGY implementations
to check that the buffer points to the correct vnode.


116384 15-Jun-2003 rwatson

Re-implement kernel access control for quotactl() as found in the
UFS quota implementation. Push some quite broken access control
logic out of ufs_quotactl() into the individual command
implementations in ufs_quota.c; fix that logic. Pass in the thread
argument to any quotactl command that will need to perform access
control.

o quotaon() requires privilege (PRISON_ROOT).

o quotaoff() requires privilege (PRISON_ROOT).

o getquota() requires that:

If the type is USRQUOTA, either the effective uid match the
requested quota ID, that the unprivileged_get_quota flag be
set, or that the thread be privileged (PRISON_ROOT).

If the type is GRPQUOTA, require that either the thread be
a member of the group represented by the requested quota ID,
that the unprivileged_get_quota flag be set, or that the
thread be privileged (PRISON_ROOT).

o setquota() requires privilege (PRISON_ROOT).

o setuse() requires privilege (PRISON_ROOT).

o qsync() requires no special privilege (consistent with what
was present before, but probably not very useful).

Add a new sysctl, security.bsd.unprivileged_get_quota, which when
set to a non-zero value, will permit unprivileged users to query user
quotas with non-matching uids and gids. Set this to 0 by default
to be mostly consistent with the previous behavior (the same for
USRQUOTA, but not for GRPQUOTA).

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


116271 12-Jun-2003 phk

Initialize struct vfsops C99-sparsely.

Submitted by: hmp
Reviewed by: phk


116192 11-Jun-2003 obrien

Use __FBSDID().


115869 05-Jun-2003 rwatson

Implement ffs_listextattr() by breaking out that logic and special-cased
attribute name of "" from ffs_getextattr(). Invoking VOP_GETETATTR()
with an empty name is now no longer supported; user application
compatibility is provided by a system call level compatibility
wrapper. We make sure to explicitly reject attempts to set an EA
with the name "".

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


115865 05-Jun-2003 rwatson

Don't special-case handling of the empty string in the UFS1
extended attribute retrieval code: it's no longer special-cased,
and is caught by the normal UFS1 EA validity checks (and, in
fact, returns the same error, EINVAL).

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


115588 01-Jun-2003 rwatson

Return EOPNOTSUPP for attempted EA operations on VCHR vnodes in UFS2;
if we permit them to occur, the kernel panics due to our performing
EA operations using VOP_STRATEGY on the vnode. This went unnoticed
previously because there are very for users of device nodes on UFS2
due to the introduction of devfs. However, this can come up with
the Linux compat directories and its hard-coded dev nodes (which will
need to go away as we move away from hard-coded device numbers).
This can come up if you use EA-intensive features such as ACLs and
MAC.

The proper fix is pretty complicated, but this band-aid would be
an excellent MFC candidate for the release.


115526 31-May-2003 phk

Remove unused variable.

Found by: FlexeLint


115474 31-May-2003 phk

Remove unused local variables.

Found by: FlexeLint


115456 31-May-2003 phk

The IO_NOWDRAIN and B_NOWDRAIN hacks are no longer needed to prevent
deadlocks with vnode backed md(4) devices because md now uses a
kthread to run the bio requests instead of doing it directly from
the bio down path.


115145 18-May-2003 alc

Lock the vm object when performing vm_object_page_clean().

Approved by: re (rwatson)


115040 15-May-2003 rwatson

Jeff added locking assertions that the VV_ flags on vnodes were modified
only while holding appropriate vnode locks. This patch slides the lock
release for ufs_extattr_enable() to continue to hold the active vnode lock
on a backing file until after the flag change; it also acquires a vnode
lock when disabling an attribute and hence clearing a flag on the backing
vnode. This permits VFS_DEBUG_LOCKS to run UFS1 extended attributes
without panicking, as well as preventing a potential race and vnode flag
problem.

Approved by: re (jhb)
Pointed out by: DEBUG_VFS_LOCKS


114599 03-May-2003 alc

Lock the vm_object on entry to vm_object_vndeallocate().


114396 01-May-2003 tjr

Do not attempt to free NULL dinodes (i_din1 or i_din2) in ffs_ifree().
These fields can be left as NULL if ffs_vget() allocates an inode but
fails before the dinode memory has been allocated. There are two cases
when this can occur: when we lose a race and another process has added
the inode to the hash, and when reading the inode off disk fails.

The bug was observed by Kris on one of the package-building machines.
See http://marc.theaimsgroup.com/?l=freebsd-current&m=105172731013411&w=2
In Kris's case, it was the bread() that failed because of a disk error.

The alternative to this patch is to ensure that ffs_vget() does not call
vput() when the inode that hasn't been properly initialised.


114395 01-May-2003 tjr

Free i_din2 instead of i_din1 in ffs_ifree() on UFS2 filesystems.
This is purely a cosmetic change because these members are in a
union together.


114293 30-Apr-2003 markm

Fix some easy, global, lint warnings. In most cases, this means
making some local variables static. In a couple of cases, this means
removing an unused variable.


114216 29-Apr-2003 kan

Deprecate machine/limits.h in favor of new sys/limits.h.
Change all in-tree consumers to include <sys/limits.h>

Discussed on: standards@
Partially submitted by: Craig Rodrigues <rodrigc@attbi.com>


113872 22-Apr-2003 jhb

Lock both the proc lock and sched_lock when calling sched_nice since
kg_nice is now protected by both. Being protected by both means that
other places in the kernel that want to read kg_nice only need one of the
two locks.


113376 12-Apr-2003 jeff

- Use the sched_nice() api instead of setting the nice value directly.

Tested by: Steve Kargl <sgk@troutmask.apl.washington.edu>


113175 06-Apr-2003 alc

Sufficient access checks are performed by vmapbuf() that calling useracc()
is pointless. Remove the call to useracc().

Don't reinitialize fields that are already initialized by getpbuf().

Reviewed by: tegge


112724 27-Mar-2003 tegge

Check return value from vmapbuf instead of the function address.


112718 27-Mar-2003 tegge

Eliminate a buffer sleep/wakeup race.


112694 26-Mar-2003 tegge

Add support for reading directly from file to userland buffer when the
O_DIRECT descriptor status flag is set and both offset and length is a
multiple of the physical media sector size.


112451 20-Mar-2003 jhb

Use td->td_ucred instead of td->td_proc->p_ucred.


112450 20-Mar-2003 jhb

Minor fixes to ffs_fserr():
- Assume that curthread is not NULL. It never is in -current.
- Use td_ucred instead of p_ucred.


112367 18-Mar-2003 phk

Including <sys/stdint.h> is (almost?) universally only to be able to use
%j in printfs, so put a newsted include in <sys/systm.h> where the printf
prototype lives and save everybody else the trouble.


112181 13-Mar-2003 jeff

- Remove a race between fsync like functions and flushbufqueues() by
requiring locked bufs in vfs_bio_awrite(). Previously the buf could
have been written out by fsync before we acquired the buf lock if it
weren't for giant. The cluster_wbuild() handles this race properly but
the single write at the end of vfs_bio_awrite() would not.
- Modify flushbufqueues() so there is only one copy of the loop. Pass a
parameter in that says whether or not we should sync bufs with deps.
- Call flushbufqueues() a second time and then break if we couldn't find
any bufs without deps.


111972 07-Mar-2003 mckusick

Use the appropriate size when zeroing out the unused portion
of a snapshot's copy of a superblock. This patch fixes a panic
when taking a snapshot of a 4096/512 filesystem.

Reported by: Ian Freislich <ianf@za.uu.net>
Sponsored by: DARPA & NAI Labs.


111937 06-Mar-2003 alc

Remove ENABLE_VFS_IOOPT. It is a long unfinished work-in-progress.

Discussed on: arch@


111856 04-Mar-2003 jeff

- Add a new 'flags' parameter to getblk().
- Define one flag GB_LOCK_NOWAIT that tells getblk() to pass the LK_NOWAIT
flag to the initial BUF_LOCK(). This will eventually be used in cases
were we want to use a buffer only if it is not currently in use.
- Convert all consumers of the getblk() api to use this extra parameter.

Reviwed by: arch
Not objected to by: mckusick


111841 03-Mar-2003 njl

Finish cleanup of vprint() which was begun with changing v_tag to a string.
Remove extraneous uses of vop_null, instead defering to the default op.
Rename vnode type "vfs" to the more descriptive "syncer".
Fix formatting for various filesystems that use vop_print.


111748 02-Mar-2003 des

More low-hanging fruit: kill caddr_t in calls to wakeup(9) / [mt]sleep(9).


111510 25-Feb-2003 mckusick

Change the field used to test whether the superblock has been updated
from the filesystem size field to the filesystem maximum blocksize
field. The problem is that older versions of growfs updated only the
new size field and not the old size field. This resulted in the old
(smaller) size field being copied up to the new size field which
caused the filesystem to appear to fsck to be badly trashed.

This also adds a sanity check to ensure that the superblock is not
being updated when the filesystem is mounted read-only. Obviously
such an update should never happen.

Reported by: Nate Lawson <nate@root.org>
Sponsored by: DARPA & NAI Labs.


111463 25-Feb-2003 jeff

- Add an interlock argument to BUF_LOCK and BUF_TIMELOCK.
- Remove the buftimelock mutex and acquire the buf's interlock to protect
these fields instead.
- Hold the vnode interlock while locking bufs on the clean/dirty queues.
This reduces some cases from one BUF_LOCK with a LK_NOWAIT and another
BUF_LOCK with a LK_TIMEFAIL to a single lock.

Reviewed by: arch, mckusick


111423 24-Feb-2003 das

Expand the reference count on struct dquot to 32 bits.
This fixes a panic on large systems where a single user
may have more than 64K active or inactive vnodes.

PR: 48234
Reviewed by: mike (mentor)


111420 24-Feb-2003 mckusick

When removing the last item from a non-empty worklist, the worklist
tail pointer must be updated.

Reported by: Kris Kennaway <kris@obsecurity.org>
Sponsored by: DARPA & NAI Labs.


111240 22-Feb-2003 mckusick

This patch fixes a deadlock between the bufdaemon and a process taking
a snapshot. As part of taking a snapshot of a filesystem, the kernel
builds up a list of the filesystem metadata (such as the cylinder
group bitmaps) that are contained in the snapshot. When doing a
copy-on-write check, the list is first consulted. If the block being
written is found on the list, then the full snapshot lookup can be
avoided. Besides providing an important performance speedup this
check also avoids a potential deadlock between the code creating
the snapshot and the bufdaemon trying to cleanup snapshot related
buffers. This fix creates a temporary list containing the key
metadata blocks that can cause the deadlock. This temporary list
is used between the time that the snapshot is first enabled and the
time that the fully complete list is built.

Reported by: Attila Nagy <bra@fsn.hu>
Sponsored by: DARPA & NAI Labs.


111239 22-Feb-2003 mckusick

This patch fixes a bug on an active filesystem on which a snapshot
is being taken from panicing with either "freeing free block" or
"freeing free inode". The problem arises when the snapshot code
is scanning the filesystem looking for inodes with a reference
count of zero (e.g., unlinked but still open) so that it can
expunge them from its view. If it encounters a reclaimed vnode
and has to restart its scan, then it will panic if it encounters
and tries to free an inode that it has already processed. The fix
is to check each candidate inode to see if it has already been
processed before trying to delete it from the snapshot image.

Sponsored by: DARPA & NAI Labs.


111238 22-Feb-2003 mckusick

This patch fixes a bug in the logical block calculation macros so
that they convert to 64-bit values before shifting rather than
afterwards. Once fixed, they can be used rather than inline expanded.

Sponsored by: DARPA & NAI Labs.


111119 19-Feb-2003 imp

Back out M_* changes, per decision of the TRB.

Approved by: trb


110885 14-Feb-2003 mckusick

Replace use of random() with arc4random() to provide less guessable
values for the initial inode generation numbers in newfs and for
newly allocated inode generation numbers in the kernel.

Submitted by: Theo de Raadt <deraadt@cvs.openbsd.org>
Sponsored by: DARPA & NAI Labs.


110837 14-Feb-2003 mckusick

Correct lines incorrectly added to the copyright message.

Submitted by: Frank van der Linden <fvdl@wasabisystems.com>
Sponsored by: DARPA & NAI Labs.


110584 09-Feb-2003 jeff

- Cleanup unlocked accesses to buf flags by introducing a new b_vflag member
that is protected by the vnode lock.
- Move B_SCANNED into b_vflags and call it BV_SCANNED.
- Create a vop_stdfsync() modeled after spec's sync.
- Replace spec_fsync, msdos_fsync, and hpfs_fsync with the stdfsync and some
fs specific processing. This gives all of these filesystems proper
behavior wrt MNT_WAIT/NOWAIT and the use of the B_SCANNED flag.
- Annotate the locking in buf.h


110234 02-Feb-2003 alfred

Catch more uses of MIN().


109623 21-Jan-2003 alfred

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


109153 13-Jan-2003 dillon

Bow to the whining masses and change a union back into void *. Retain
removal of unnecessary casts and throw in some minor cleanups to see if
anyone complains, just for the hell of it.


109123 12-Jan-2003 dillon

Change struct file f_data to un_data, a union of the correct struct
pointer types, and remove a huge number of casts from code using it.

Change struct xfile xf_data to xun_data (ABI is still compatible).

If we need to add a #define for f_data and xf_data we can, but I don't
think it will be necessary. There are no operational changes in this
commit.


109053 10-Jan-2003 marcel

o Improve wording of the comment that accompanies fs_pad. The
padding is not specific to non-i386 architectures. It is
caused by non-i386 specific alignment requirements of
fs_swuid,
o Add a CTASSERT to catch a change in the size of struct fs
at compile-time rather than run-time.

Ok'd: gordon
Tested on: i386 ia64


109034 09-Jan-2003 gordon

Fix superblock alignment problems on non-i386 platforms. Also change fs_uuid
to fs_swuid, making it more descriptive.

Submitted by: marcel
Reviewed by: peter
Pointy hat to: gordon


108970 08-Jan-2003 gordon

Steal some space from fs_fsmnt to create fs_volname and fs_uuid. The volname
will be used to support volume names with the help of a GEOM module (to be
committed). uuid will be used to deal with conflicting volume names (which
doesn't work just yet).

Approved by: mckusick@


108892 07-Jan-2003 mckusick

This patch fixes a problem caused by applications that rapidly and
repeatedly truncate the same file. Each time the file is truncated,
a buffer is grabbed to store the indirect block numbers that need
to be freed. Those blocks cannot be freed until the inode claiming
them is written to disk. Thus, the number of buffers being held by
soft updates explodes and in extreme cases can run the kernel out
of buffers. The problem can be avoided by doing an fsync on the
file every debug.maxindirdep truncates (currently defaulted to 50).
The fsync causes the inode to be written so that the held buffers
can be freed. The check for excessive buffers is checked as part
of the existing hook for excessive dependencies (softdep_slowdown)
in the truncate code.

Reported by: David Schultz <dschultz@uclink.Berkeley.EDU>
Sponsored by: DARPA & NAI Labs.
MFC after: 3 weeks


108686 04-Jan-2003 phk

Temporarily introduce a new VOP_SPECSTRATEGY operation while I try
to sort out disk-io from file-io in the vm/buffer/filesystem space.

The intent is to sort VOP_STRATEGY calls into those which operate
on "real" vnodes and those which operate on VCHR vnodes. For
the latter kind, the call will be changed to VOP_SPECSTRATEGY,
possibly conditionally for those places where dual-use happens.

Add a default VOP_SPECSTRATEGY method which will call the normal
VOP_STRATEGY. First time it is called it will print debugging
information. This will only happen if a normal vnode is passed
to VOP_SPECSTRATEGY by mistake.

Add a real VOP_SPECSTRATEGY in specfs, which does what VOP_STRATEGY
does on a VCHR vnode today.

Add a new VOP_STRATEGY method in specfs to catch instances where
the conversion to VOP_SPECSTRATEGY has not yet happened. Handle
the request just like we always did, but first time called print
debugging information.

Apart up to two instances of console messages per boot, this amounts
to a glorified no-op commit.

If you get any of the messages on your console I would very much
like a copy of them mailed to phk@freebsd.org


108648 04-Jan-2003 phk

Since Jeffr made the std* functions the default in rev 1.63 of
kern/vfs_defaults.c it is wrong for the individual filesystems to use
the std* functions as that prevents override of the default.

Found by: src/tools/tools/vop_table


108589 03-Jan-2003 phk

Convert calls to BUF_STRATEGY to VOP_STRATEGY calls. This is a no-op since
all BUF_STRATEGY did in the first place was call VOP_STRATEGY.


108533 01-Jan-2003 schweikh

Correct typos, mostly s/ a / an / where appropriate. Some whitespace cleanup,
especially in troff files.


108524 01-Jan-2003 alfred

When compiling the kernel do not implicitly include filedesc.h from proc.h,
this was causing filedesc work to be very painful.
In order to make this work split out sigio definitions to thier own header
(sigio.h) which is included from proc.h for the time being.


108316 27-Dec-2002 phk

Use three UMA zones for FFS/UFS inodes instead of malloc space.
Since inodes are currently 144 bytes, this will save 112 bytes per
inode. This can amount to up to 10MByte on large systems.


108315 27-Dec-2002 phk

Move the allocation of the inode contents into ffs_vfsops.c rather than
passing malloc types around.


108313 27-Dec-2002 phk

Make ffs_mountfs() static.

Remove the malloctype from the ufs mount structure, instead add a callback
to the storage method for freeing inodes: UFS_IFREE().

Add vfs_ifree() method function which frees an inode.

Unvariablelize the malloc type used for allocating inodes.


108050 18-Dec-2002 mckusick

Fix corruption introduced in previous delta.

Reported by: Aurelien Nephtali <aurelien.nephtali@wanadoo.fr>
Sponsored by: DARPA & NAI Labs.


108017 18-Dec-2002 mckusick

Keep comments consistent with the code. Minor optimization.

Sponsored by: DARPA & NAI Labs.


108010 18-Dec-2002 mckusick

Cosmetic cleanup of unsigned buglets.

Submitted by: Bruce Evans <bde@zeta.org.au>
Sponsored by: DARPA & NAI Labs.


107992 17-Dec-2002 phk

Remove unused lockcnt variable.

Approved by: mckusick


107915 15-Dec-2002 mckusick

Update to previous change (1.54) to use an approperly wide inode field
so as to work correctly on 64-bit platforms.

Reported-by: Jake Burkholder <jake@locore.ca>
Sponsored by: DARPA & NAI Labs.
Approved by: Ian Dowse <iedowse@maths.tcd.ie>


107868 14-Dec-2002 iedowse

Undo the adjustment of the total memory used by dirhash in the case
where allocating the dirhash structure fails. Fix a few typos in
comments and update copyright.

MFC after: 1 week


107848 14-Dec-2002 mckusick

Only the most recent snapshot contains the complete list of blocks
that were copied in all of the earlier snapshots, thus its precomputed
list must be used in the copyonwrite test. Using incomplete lists may
lead to deadlock. Also do not include the blocks used for the indirect
pointers in the indirect pointers as this may lead to inconsistent
snapshots.

Sponsored by: DARPA & NAI Labs.
Approved by: re


107762 12-Dec-2002 trhodes

Remove the comment about dump(8) not working properly with snapshots.

Discussed with: mckusick
Approved by: re (rwatson)


107651 06-Dec-2002 mckusick

More tightly verify the preference returned for the new inode.

Submitted by: Kris Kennaway <kris@obsecurity.org>
Sponsored by: DARPA & NAI Labs.
Approved by: re


107558 03-Dec-2002 mckusick

Have to use bread() rather than UFS_BALLOC() when obtaining a
previously allocated block as the previous use of the block may
have fallen out of the cache. Failure to reread its contents cause
zeroed results to be written instead of the proper contents.
Conversely, when the block is going to be entirely filled in, it
is not necessary reread the old contents.

Sponsored by: DARPA & NAI Labs.
Approved by: re


107415 30-Nov-2002 mckusick

Add a check to disable the previous patch so that future filesystems
that choose to place their superblocks in non-standard locations will
not get them smashed.

Sponsored by: DARPA & NAI Labs.


107414 30-Nov-2002 mckusick

Remove a race condition / deadlock from snapshots. When
converting from individual vnode locks to the snapshot
lock, be sure to pass any waiting processes along to the
new lock as well. This transfer is done by a new function
in the lock manager, transferlockers(from_lock, to_lock);
Thanks to Lamont Granquist <lamont@scriptkiddie.org> for
his help in pounding on snapshots beyond all reason and
finding this deadlock.

Sponsored by: DARPA & NAI Labs.


107406 30-Nov-2002 mckusick

Fix two deadlocks in snapshots:

1) Release the snapshot file lock while suspending the system. Otherwise
a process trying to read the lock may block on its containing directory
preventing the suspension from completing. Thanks to Sean Kelly
<smkelly@zombie.org> for finding this deadlock.

2) Replace some bdwrite's with bawrite's so as not to fill all the
buffers with dirty data. The buffers could not be cleaned as the
snapshot vnode was locked hence the system could deadlock when
making snapshots of really massive filesystems. Thanks to
Hidetoshi Shimokawa <simokawa@sat.t.u-tokyo.ac.jp> for figuring
this out.

Sponsored by: DARPA & NAI Labs.


107393 29-Nov-2002 mckusick

Check to make sure that the fs_sblockloc field was properly updated
before using it to write the superblock. This is to guard against
accidentally trashing the disklabel if the superblock format missed
being upgraded by the new kernel.

Reported by: Sam Leffler <sam@errno.com>
Sponsored by: DARPA & NAI Labs.
Approved by: Murray Stokely <murray@FreeBSD.org>


107294 27-Nov-2002 mckusick

Create a new 32-bit fs_flags word in the superblock. Add code to move
the old 8-bit fs_old_flags to the new location the first time that the
filesystem is mounted by a new kernel. One of the unused flags in
fs_old_flags is used to indicate that the flags have been moved.
Leave the fs_old_flags word intact so that it will work properly if
used on an old kernel.

Change the fs_sblockloc superblock location field to be in units
of bytes instead of in units of filesystem fragments. The old units
did not work properly when the fragment size exceeeded the superblock
size (8192). Update old fs_sblockloc values at the same time that
the flags are moved.

Suggested by: BOUWSMA Barry <freebsd-misuser@netscum.dyndns.dk>
Sponsored by: DARPA & NAI Labs.


107096 20-Nov-2002 mckusick

The target for the maximum number of dependencies has been cut
in half because of reports that under heavy load the kernel could
exhaust its memory pool. The limit is now (desiredvnodes * 4)
rather than (desiredvnodes * 8), so it will still scale with
larger systems, just not as quickly.

Sponsored by: DARPA & NAI Labs.


107095 20-Nov-2002 mckusick

If an error occurs while writing a buffer, then the data will
not have hit the disk and the dependencies cannot be unrolled.
In this case, the system will mark the buffer as dirty again so
that the write can be retried in the future. When the write
succeeds or the system gives up on the buffer and marks it as
invalid (B_INVAL), the dependencies will be cleared.

Sponsored by: DARPA & NAI Labs.


106965 15-Nov-2002 peter

Do not assume that time_t is an int.

Approved by: re (jhb)


106673 08-Nov-2002 jhb

Print daddr_t's with %j and intmax_t.


106394 04-Nov-2002 rwatson

Update licenses and wording: NAI has authorized the removal of clause three
of their BSD-style license; also, carry out the NAI Labs -> Network
Associates Laboratories renaming in these files.


106058 27-Oct-2002 wollman

Implement the new 1003.1-2001 pathconf() keys, including the Advisory
Information option. Other filesystem implementations should do something
similar.

With advice from: mckusick, phk


105988 26-Oct-2002 rwatson

Slightly change the semantics of vnode labels for MAC: rather than
"refreshing" the label on the vnode before use, just get the label
right from inception. For single-label file systems, set the label
in the generic VFS getnewvnode() code; for multi-label file systems,
leave the labeling up to the file system. With UFS1/2, this means
reading the extended attribute during vfs_vget() as the inode is
pulled off disk, rather than hitting the extended attributes
frequently during operations later, improving performance. This
also corrects sematics for shared vnode locks, which were not
previously present in the system. This chances the cache
coherrency properties WRT out-of-band access to label data, but in
an acceptable form. With UFS1, there is a small race condition
during automatic extended attribute start -- this is not present
with UFS2, and occurs because EAs aren't available at vnode
inception. We'll introduce a work around for this shortly.

Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


105902 25-Oct-2002 mckusick

Within ufs, the ffs_sync and ffs_fsync functions did not always
check for and/or report I/O errors. The result is that a VFS_SYNC
or VOP_FSYNC called with MNT_WAIT could loop infinitely on ufs in
the presence of a hard error writing a disk sector or in a filesystem
full condition. This patch ensures that I/O errors will always be
checked and returned. This patch also ensures that every call to
VFS_SYNC or VOP_FSYNC with MNT_WAIT set checks for and takes
appropriate action when an error is returned.

Sponsored by: DARPA & NAI Labs.


105823 23-Oct-2002 mckusick

We must be careful to avoid recursive copy-on-write faults when
trying to clean up during disk-full senarios.

Sponsored by: DARPA & NAI Labs.


105763 23-Oct-2002 mckusick

Missplaced FREE_LOCK causes a panic when hit while taking a snapshot.

Sponsored by: DARPA & NAI Labs.


105670 22-Oct-2002 mckusick

This update further fine tunes the locking of snapshot vnodes in
the ffs_copyonwrite routine to avoid a deadlock between the syncer
daemon trying to sync out a snapshot vnode and the bufdaemon
trying to write out a buffer containing the snapshot inode.
With any luck this will be the last snapshot race condition.

Sponsored by: DARPA & NAI Labs.


105669 22-Oct-2002 mckusick

This update is a performance improvement when allocating blocks on
a full filesystem. Previously, if the allocation failed, we had to
fsync the file before rolling back any partial allocation of indirect
blocks. Most block allocation requests only need to allocate a single
data block and if that allocation fails, there is nothing to unroll.
So, before doing the fsync, we check to see if any rollback will
really be necessary. If none is necessary, then we simply return.
This update eliminates the flurry of disk activity that got triggered
whenever a filesystem would run out of space.

Sponsored by: DARPA & NAI Labs.


105667 22-Oct-2002 mckusick

This checkin reimplements the io-request priority hack in a way
that works in the new threaded kernel. It was commented out of
the disksort routine earlier this year for the reasons given in
kern/subr_disklabel.c (which is where this code used to reside
before it moved to kern/subr_disk.c):

----------------------------
revision 1.65
date: 2002/04/22 06:53:20; author: phk; state: Exp; lines: +5 -0
Comment out Kirks io-request priority hack until we can do this in a
civilized way which doesn't cause grief.

The problem is that it is not generally safe to cast a "struct bio
*" to a "struct buf *". Things like ccd, vinum, ata-raid and GEOM
constructs bio's which are not entrails of a struct buf.

Also, curthread may or may not have anything to do with the I/O request
at hand.

The correct solution can either be to tag struct bio's with a
priority derived from the requesting threads nice and have disksort
act on this field, this wouldn't address the "silly-seek syndrome"
where two equal processes bang the diskheads from one edge to the
other of the disk repeatedly.

Alternatively, and probably better: a sleep should be introduced
either at the time the I/O is requested or at the time it is completed
where we can be sure to sleep in the right thread.

The sleep also needs to be in constant timeunits, 1/hz can be practicaly
any sub-second size, at high HZ the current code practically doesn't
do anything.
----------------------------

As suggested in this comment, it is no longer located in the disk sort
routine, but rather now resides in spec_strategy where the disk operations
are being queued by the thread that is associated with the process that
is really requesting the I/O. At that point, the disk queues are not
visible, so the I/O for positively niced processes is always slowed
down whether or not there is other activity on the disk.

On the issue of scaling HZ, I believe that the current scheme is
better than using a fixed quantum of time. As machines and I/O
subsystems get faster, the resolution on the clock also rises.
So, ten years from now we will be slowing things down for shorter
periods of time, but the proportional effect on the system will
be about the same as it is today. So, I view this as a feature
rather than a drawback. Hence this patch sticks with using HZ.

Sponsored by: DARPA & NAI Labs.
Reviewed by: Poul-Henning Kamp <phk@critter.freebsd.dk>


105572 20-Oct-2002 rwatson

Rename _POSIX_FOO_PRESENT and friends from POSIX.1e to _PC_FOO_PRESENT
and related friends. This would have been corrected had POSIX.1e
progressed to a standard.

Pointed out by: wollman


105571 20-Oct-2002 rwatson

Implement _POSIX_ACL_PATH_MAX, which returns the maximum number of ACL
entries for a file system node using pathconf().

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


105567 20-Oct-2002 rwatson

Teach UFS to respond to pathconf() tests for _POSIX_ACL_EXTENDED and
_POSIX_MAC_PRESENT based on available mount flags, if the services are
available.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


105456 19-Oct-2002 rwatson

Clarify that the UFS1 extended attribute configuration steps do not apply
to UFS2 file systems.

Submitted by: jedgar
Obtained from: TrustedBSD Project


105422 18-Oct-2002 dillon

Fix a file-rewrite performance case for UFS[2]. When rewriting portions
of a file in chunks that are less then the filesystem block size, if the
data is not already cached the system will perform a read-before-write.
The problem is that it does this on a block-by-block basis, breaking up the
I/Os and making clustering impossible for the writes. Programs such
as INN using cyclic file buffers suffer greatly. This problem is only going
to get worse as we use larger and larger filesystem block sizes.

The solution is to extend the sequential heuristic so UFS[2] can perform
a far larger read and readahead when dealing with this case.

(note: maximum disk write bandwidth is 27MB/sec thru filesystem)
(note: filesystem blocksize in test is 8K (1K frag))
dd if=/dev/zero of=test.dat bs=1k count=2m conv=notrunc

Before: (note half of these are reads)
tty da0 da1 acd0 cpu
tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id
0 76 14.21 598 8.30 0.00 0 0.00 0.00 0 0.00 0 0 7 1 92
0 76 14.09 813 11.19 0.00 0 0.00 0.00 0 0.00 0 0 9 5 86
0 76 14.28 821 11.45 0.00 0 0.00 0.00 0 0.00 0 0 8 1 91

After: (note half of these are reads)
tty da0 da1 acd0 cpu
tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id
0 76 63.62 434 26.99 0.00 0 0.00 0.00 0 0.00 0 0 18 1 80
0 76 63.58 424 26.30 0.00 0 0.00 0.00 0 0.00 0 0 17 2 82
0 76 63.82 438 27.32 0.00 0 0.00 0.00 0 0.00 1 0 19 2 79

Reviewed by: mckusick
Approved by: re
X-MFC after: immediately (was heavily tested in -stable for 4 months)


105417 18-Oct-2002 rwatson

Update extended attribute readme file to note that no special configuration
is required to use EAs with UFS2, and that UFS2 is recommend for EA use
for a variety of reasons.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


105416 18-Oct-2002 rwatson

Update instructions for ACLs given recent tunefs, mount changes. Also
note that UFS2 doesn't require explicit extended attribute configuration,
and is recommends for this and other reasons if you plan to use ACLs.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


105415 18-Oct-2002 rwatson

Use 'size_t' instead of 'int' for the result of sizeof().


105368 18-Oct-2002 mckusick

With the revised single-lock method used in snapshots, the
BA_NOWAIT flag is no longer needed.

Sponsored by: DARPA & NAI Labs.


105191 16-Oct-2002 mckusick

Change locking so that all snapshots on a particular filesystem share
a common lock. This change avoids a deadlock between snapshots when
separate requests cause them to deadlock checking each other for a
need to copy blocks that are close enough together that they fall
into the same indirect block. Although I had anticipated a slowdown
from contention for the single lock, my filesystem benchmarks show
no measurable change in throughput on a uniprocessor system with
three active snapshots. I conjecture that this result is because
every copy-on-write fault must check all the active snapshots, so
the process was inherently serial already. This change removes the
last of the deadlocks of which I am aware in snapshots.

Sponsored by: DARPA & NAI Labs.


105179 15-Oct-2002 rwatson

Push most UFS ACL behavior behind a check for MNT_ACLS, permitting ACLs
to be administratively disabled as needed on UFS/UFS2 file systems. This
also has the effect of preventing the slightly more expensive ACL code
from running on non-ACL file systems, avoiding storage allocation for
ACLs that may be read from disk. MNT_ACLS may be set at mount-time
using mount -o acls, or implicitly by setting the FS_ACLS flag using
tunefs. On UFS1, you may also have to configure ACL store.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


105169 15-Oct-2002 rwatson

If the FS_MULTILABEL flag is set in a UFS or UFS2 superblock,
automatically set MNT_MULTILABEL in the mount flags.

If FS_ACLS is set in a UFS or UFS2 superblock, automatically
set MNT_ACLS in the mount flags.

If either of these flags is set, but the appropriate kernel option
to support the features associated with the flag isn't available,
then print a warning at mount-time.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


105136 14-Oct-2002 mckusick

When reading or writing the extended attributes of a special device
or fifo in UFS2, the normal ufs_strategy routine needs to be used
rather than the spec_strategy or fifo_strategy routine. Thus the
ffsext_strategy routine is interposed in the ffs_vnops vectors for
special devices and fifo's to pick off this special case. Otherwise
it simply falls through to the usual spec_strategy or fifo_strategy
routine.

Submitted by: Robert Watson <rwatson@FreeBSD.org>
Sponsored by: DARPA & NAI Labs.


105123 14-Oct-2002 rwatson

Fix two memory leaks in error conditions involving the UFS ACL code:
if failures occur, make sure that we release both the default ACL
and access ACL storage during new object creation.

Spotted by: phk and his pet flexelint
Sponsored by: DARPA, Network Associates Laboratories


105112 14-Oct-2002 rwatson

Define two new superblock file system flags:

FS_ACLS Administrative enable/disable of extended ACL support
FS_MULTILABEL Administrative flag to indicate to the MAC Framework
that objects in the file system are individually
labeled using extended attributes.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
Reviewed by: (in principal) mckusick, phk


105077 14-Oct-2002 mckusick

Regularize the vop_stdlock'ing protocol across all the filesystems
that use it. Specifically, vop_stdlock uses the lock pointed to by
vp->v_vnlock. By default, getnewvnode sets up vp->v_vnlock to
reference vp->v_lock. Filesystems that wish to use the default
do not need to allocate a lock at the front of their node structure
(as some still did) or do a lockinit. They can simply start using
vn_lock/VOP_UNLOCK. Filesystems that wish to manage their own locks,
but still use the vop_stdlock functions (such as nullfs) can simply
replace vp->v_vnlock with a pointer to the lock that they wish to
have used for the vnode. Such filesystems are responsible for
setting the vp->v_vnlock back to the default in their vop_reclaim
routine (e.g., vp->v_vnlock = &vp->v_lock).

In theory, this set of changes cleans up the existing filesystem
lock interface and should have no function change to the existing
locking scheme.

Sponsored by: DARPA & NAI Labs.


104908 11-Oct-2002 mike

Change iov_base's type from `char *' to the standard `void *'. All
uses of iov_base which assume its type is `char *' (in order to do
pointer arithmetic) have been updated to cast iov_base to `char *'.


104716 09-Oct-2002 mux

Fix build of 64 bit platforms.


104702 09-Oct-2002 mckusick

When creating a snapshot, create a list of initially allocated blocks.
Whenever doing a copy-on-write check, first look in the list of
initially allocated blocks to see if it is there. If so, no further
check is needed. If not, fall through and do the full check. This
change eliminates one of two known deadlocks caused by snapshots.
Handling the second deadlock will be the subject of another check-in.
This change also reduces the cost of the copy-on-write check by
speeding up the verification of frequently checked blocks.

Sponsored by: DARPA & NAI Labs.


104698 09-Oct-2002 mckusick

When creating a snapshot, create a list of initially allocated blocks.
Whenever doing a copy-on-write check, first look in the list of
initially allocated blocks to see if it is there. If so, no further
check is needed. If not, fall through and do the full check. This
change eliminates one of two known deadlocks caused by snapshots.
Handling the second deadlock will be the subject of another check-in.
This change also reduces the cost of the copy-on-write check by
speeding up the verification of frequently checked blocks.

Sponsored by: DARPA & NAI Labs.


104697 09-Oct-2002 mckusick

The appropriate units for disk block addresses are always DEV_BSIZE,
even when the underlying device has a larger sector size. Therefore,
the filesystem code should not (and with this patch does not) try to
use the underlying sector size when doing disk block address calculations.

This patch fixes problems in -current when using the swap-based
memory-disk device (mdconfig -a -t swap ...). This bugfix is not
relevant to -stable as -stable does not have the memory-disk device.

Sponsored by: DARPA & NAI Labs.


104688 08-Oct-2002 jeff

- Remove LK_INTERLOCK from the vn_lock() in ffs_snapshot().

Pointy hat to: me
Found by: green


104364 02-Oct-2002 phk

Mark two places where an unsigned number is checked "if (foo < 0)" with
an XXX comment.

Somebody[TM] should look at this in some detail.

Spotted by: FlexeLint


104346 02-Oct-2002 dd

size_t is not a struct (fix mislabelling in a comment).


104302 01-Oct-2002 phk

Fix some harmless mis-indents.

Spotted by: FlexeLint


104104 28-Sep-2002 jmallett

When spamming me with a printf(9), under DIAGNOSTIC, at least be nice enough
to include a newline.

MFC after: 4 days
Sponsored by: Bright Path Solutions


104094 28-Sep-2002 phk

Be consistent about "static" functions: if the function is marked
static in its prototype, mark it static at the definition too.

Inspired by: FlexeLint warning #512


104052 27-Sep-2002 phk

Make it a tad easier to deal with struct inode in userland programs which
fondle /dev/kmem by using "struct cdev *" instead of "dev_t".

Requsted by: jake


104051 27-Sep-2002 phk

Use our mount-credential if we get a NOCRED when we try to write out EA
space back to disk.

This is wrong in many ways, but not as wrong as a panic.

Pancied on: rwatson & jmallet
Sponsored by: DARPA & NAI Labs.


103946 25-Sep-2002 jeff

- Convert locks to use standard macros.
- Lock access to the buflists.
- Document broken locking.
- Use vrefcnt().


103945 25-Sep-2002 jeff

- Document broken locking.
- Use vrefcnt().


103944 25-Sep-2002 jeff

- Lock accesses to v_usecount.
- Convert interlock locks to use standard macros.


103943 25-Sep-2002 jeff

- Don't use the interlock to protect v_writecount.


103690 20-Sep-2002 phk

We don't need to #include <sys/disklabel.h>.
We don't need to #include <sys/disklabel.h> second time either.

Sponsored by: DARPA & NAI Labs.


103636 19-Sep-2002 truckman

VOP_FSYNC() requires that it's vnode argument be locked, which nfs_link()
wasn't doing. Rather than just lock and unlock the vnode around the call
to VOP_FSYNC(), implement rwatson's suggestion to lock the file vnode
in kern_link() before calling VOP_LINK(), since the other filesystems
also locked the file vnode right away in their link methods. Remove the
locking and and unlocking from the leaf filesystem link methods.

Reviewed by: rwatson, bde (except for the unionfs_link() changes)


103594 19-Sep-2002 obrien

intmax_t is printed with %jd, not %lld.


103559 18-Sep-2002 njl

Remove any VOP_PRINT that redundantly prints the tag.
Move lockmgr_printinfo() into vprint() for everyone's benefit.

Suggested by: bde


103314 14-Sep-2002 njl

Remove all use of vnode->v_tag, replacing with appropriate substitutes.
v_tag is now const char * and should only be used for debugging.

Additionally:
1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK
2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which
is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP.

Suggested by: phk
Reviewed by: bde, rwatson (earlier version)


103180 10-Sep-2002 bde

vfs_syscalls.c:
Changed rename(2) to follow the letter of the POSIX spec. POSIX
requires rename() to have no effect if its args "resolve to the same
existing file". I think "file" can only reasonably be read as referring
to the inode, although the rationale and "resolve" seem to say that
sameness is at the level of (resolved) directory entries.

ext2fs_vnops.c, ufs_vnops.c:
Replaced code that gave the historical BSD behaviour of removing one
link name by checks that this code is now unreachable. This fixes
some races. All vnodes needed to be unlocked for the removal, and
locking at another level using something like IN_RENAME was not even
attempted, so it was possible for rename(x, y) to return with both x
and y removed even without any unlink(2) syscalls (one process can
remove x using rename(x, y) and another process can remove y using
rename(y, x)).

Prodded by: alfred
MFC after: 8 weeks
PR: 42617


102991 05-Sep-2002 phk

Implement the VOP_OPENEXTATTR() and VOP_CLOSEEXTATTR() methods.

Use extattr_check_cred() to check access to EAs.

This is still a WIP.

Sponsored by: DARPA & NAI Labs.


102988 05-Sep-2002 phk

Use canonical extattr_check_cred() instead of private implementation of the
same policy.

Sponsored by: DARPA & NAI Labs.


102985 05-Sep-2002 phk

Fix credentials check: do not leak ENOATTR until we know if they're
supposed to know.

Sponsored by: DARPA & NAI Labs.


102957 05-Sep-2002 bde

Include <sys/malloc.h> instead of depending on namespace pollution 2
layers deep in <sys/proc.h> or <sys/vnode.h>.

Include <sys/vmmeter.h> instead of depending on namespace pollution in
<sys/pcpu.h>.

Sorted includes as much as possible.


102774 01-Sep-2002 rwatson

Since we have vp and td cached in local variables, use those instead
of derefencing the VOP arguments again when calling the UFS code.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


102608 30-Aug-2002 phk

Correctly handle setting, getting and deleting EA's with zero length content.

Sponsored by: DARPA & NAI Labs.


102412 25-Aug-2002 charnier

Replace various spelling with FALLTHROUGH which is lint()able


102382 25-Aug-2002 alc

o Retire vm_page_zero_fill() and vm_page_zero_fill_area(). Ever since
pmap_zero_page() and pmap_zero_page_area() were modified to accept
a struct vm_page * instead of a physical address, vm_page_zero_fill()
and vm_page_zero_fill_area() have served no purpose.


102175 20-Aug-2002 phk

Implement list of EA return functionality.
Correctly delete EA's when the content length is set to zero.

Sponsored by: DARPA & NAI Labs.


102090 19-Aug-2002 phk

First snapshot of UFS2 EA support.

Sponsored by: DARPA & NAI Labs.


101941 15-Aug-2002 rwatson

In order to better support flexible and extensible access control,
make a series of modifications to the credential arguments relating
to file read and write operations to cliarfy which credential is
used for what:

- Change fo_read() and fo_write() to accept "active_cred" instead of
"cred", and change the semantics of consumers of fo_read() and
fo_write() to pass the active credential of the thread requesting
an operation rather than the cached file cred. The cached file
cred is still available in fo_read() and fo_write() consumers
via fp->f_cred. These changes largely in sys_generic.c.

For each implementation of fo_read() and fo_write(), update cred
usage to reflect this change and maintain current semantics:

- badfo_readwrite() unchanged
- kqueue_read/write() unchanged
pipe_read/write() now authorize MAC using active_cred rather
than td->td_ucred
- soo_read/write() unchanged
- vn_read/write() now authorize MAC using active_cred but
VOP_READ/WRITE() with fp->f_cred

Modify vn_rdwr() to accept two credential arguments instead of a
single credential: active_cred and file_cred. Use active_cred
for MAC authorization, and select a credential for use in
VOP_READ/WRITE() based on whether file_cred is NULL or not. If
file_cred is provided, authorize the VOP using that cred,
otherwise the active credential, matching current semantics.

Modify current vn_rdwr() consumers to pass a file_cred if used
in the context of a struct file, and to always pass active_cred.
When vn_rdwr() is used without a file_cred, pass NOCRED.

These changes should maintain current semantics for read/write,
but avoid a redundant passing of fp->f_cred, as well as making
it more clear what the origin of each credential is in file
descriptor read/write operations.

Follow-up commits will make similar changes to other file descriptor
operations, and modify the MAC framework to pass both credentials
to MAC policy modules so they can implement either semantic for
revocation.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


101789 13-Aug-2002 phk

Expand the arguments to ffs_ext{read,write}() to their component
parts rather than use vop_{read,write}_args. Access to these
functions will ultimately not be available through the
"vop_{read,write}+IO_EXT" API but this functionality is retained
for debugging purposes for now.

Sponsored by: DARPA & NAI Labs.


101780 13-Aug-2002 phk

Unravel the UFS_EXTATTR incest between FFS and UFS: UFS_EXTATTR is an
UFS only thing, and FFS should in principle not know if it is enabled
or not.

This commit cleans ffs_vnops.c for such knowledge, but not ffs_vfsops.c

Sponsored by: DARPA and NAI Labs.


101777 13-Aug-2002 phk

Introduce typedefs for the member functions of struct vfsops and employ
these in the main filesystems. This does not change the resulting code
but makes the source a little bit more grepable.

Sponsored by: DARPA and NAI Labs.


101744 12-Aug-2002 rwatson

Pass IO_NOMACCHECK to vn_rdwr() in the following checks to prevent
enforcement of MAC policy on the read or write operations:

- In ext2fs, don't enforce MAC on loop-back reads and writes supporting
directory read operations in lookup(), directory modifications in
rename(), directory write operations in mkdir(), symlink write
operations in symlink().

- In the NFS client locking code, perform vn_rdwr() on the NFS locking
socket without enforcing MAC, since the write is done on behalf of
the kernel NFS implementation rather than the user process.

- In UFS, don't enforce MAC on loop-back reads and writes supporting
directory read operations in lookup(), and symlink write operations
in symlink().

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


101720 12-Aug-2002 phk

Stop pretending that the FFS file ufs_readwrite.c is a UFS file.

Instead of #including it, pull it into ffs_vnops.c and name things
correctly.

Sponsored by: DARPA & NAI Labs.


101717 12-Aug-2002 phk

Fix a comment.


101398 05-Aug-2002 iedowse

Don't call softdep_slowdown() if soft updates are not active on the
filesystem. This causes a panic for kernels compiled without
softupdates.

Reported by: luigi


101308 04-Aug-2002 jeff

- Replace v_flag with v_iflag and v_vflag
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.

Idea stolen from: BSD/OS


101073 31-Jul-2002 rwatson

Introduce support for Mandatory Access Control and extensible
kernel access control.

Instrument UFS to support per-inode MAC labels. In particular,
invoke MAC framework entry points for generically supporting the
backing of MAC labels into extended attributes. This ends up
introducing new vnode operation vector entries point at the MAC
framework entry points, as well as some explicit entry point
invocations for file and directory creation events so that the
MAC framework can push labels to disk before the directory names
become persistent (this will work better once EAs in UFS2 are
hooked into soft updates). The generic EA MAC entry points
support executing with the file system in either single label
or multilabel operation, and will fall back to the mount label
if multilabel is not specified at mount-time.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


101018 31-Jul-2002 phk

I forgot this bit of uglyness in the fsck_ffs cleanup.


100926 30-Jul-2002 phk

Fix braino in last commit.


100925 30-Jul-2002 phk

Move ffs_isfreeblock() to ffs_alloc.c and make it static.

Sponsored by: DARPA & NAI Labs.


100807 28-Jul-2002 alc

Lock page queue accesses by vm_page_free().


100393 20-Jul-2002 benno

Add a missing argument to the stub for softdep_setup_freeblocks.

Forgotten by: mckusick


100382 20-Jul-2002 peter

Fix a warning:
ffs_softdep.c:1630: warning: int format, different type arg (arg 2)


100344 19-Jul-2002 mckusick

Add support to UFS2 to provide storage for extended attributes.
As this code is not actually used by any of the existing
interfaces, it seems unlikely to break anything (famous
last words).

The internal kernel interface to manipulate these attributes
is invoked using two new IO_ flags: IO_NORMAL and IO_EXT.
These flags may be specified in the ioflags word of VOP_READ,
VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that
you want to do I/O to the normal data part of the file and
IO_EXT means that you want to do I/O to the extended attributes
part of the file. IO_NORMAL and IO_EXT are mutually exclusive
for VOP_READ and VOP_WRITE, but may be specified individually
or together in the case of VOP_TRUNCATE. For example, when
removing a file, VOP_TRUNCATE is called with both IO_NORMAL
and IO_EXT set. For backward compatibility, if neither IO_NORMAL
nor IO_EXT is set, then IO_NORMAL is assumed.

Note that the BA_ and IO_ flags have been `merged' so that they
may both be used in the same flags word. This merger is possible
by assigning the IO_ flags to the low sixteen bits and the BA_
flags the high sixteen bits. This works because the high sixteen
bits of the IO_ word is reserved for read-ahead and help with
write clustering so will never be used for flags. This merge
lets us get away from code of the form:

if (ioflags & IO_SYNC)
flags |= BA_SYNC;

For the future, I have considered adding a new field to the
vattr structure, va_extsize. This addition could then be
exported through the stat structure to allow applications to
find out the size of the extended attribute storage and also
would provide a more standard interface for truncating them
(via VOP_SETATTR rather than VOP_TRUNCATE).

I am also contemplating adding a pathconf parameter (for
concreteness, lets call it _PC_MAX_EXTSIZE) which would
let an application determine the maximum size of the extended
atribute storage.

Sponsored by: DARPA & NAI Labs.


100207 17-Jul-2002 mckusick

Change utimes to set the file creation time (for filesystems that
support creation times such as UFS2) to the value of the
modification time if the value of the modification time is older
than the current creation time. See utimes(2) for further details.

Sponsored by: DARPA & NAI Labs.


100201 16-Jul-2002 mckusick

Change the name of st_createtime to st_birthtime. This change is
made to reduce confusion between st_ctime and st_createtime.

Submitted by: Eric Allman <eric@sendmail.org>
Sponsored by: DARPA & NAI Labs.


99888 12-Jul-2002 trhodes

Fix a type: s/your are/you are/


99590 08-Jul-2002 bde

Fixed some printf format errors (4 new ones reported by gcc and 5 nearby
old ones not reported by gcc). This helps unbreak LINT.


99220 01-Jul-2002 iedowse

Use indirect function pointer hooks instead of #ifdef SOFTUPDATES
direct calls for the two places where the kernel calls into soft
updates code. Set up the hooks in softdep_initialize() and NULL
them out in softdep_uninitialize(). This change allows soft updates
to function correctly when ufs is loaded as a module.

Reviewed by: mckusick


99206 01-Jul-2002 iedowse

Add the ffs bits necessary to support unloading of the ufs kernel
module. This adds an ffs_uninit() function that calls ufs_uninit()
and also calls a new softdep_uninitialize() function. Add a stub
for softdep_uninitialize() to cover the non-SOFTUPDATES case.

Reviewed by: mckusick


99101 30-Jun-2002 iedowse

Remove the bogus SYSINIT from ufs_dirhash.c and instead add a call
to ufsdirhash_init() from ufs_init(). Add uninit() functions
corresponding the ufs, dirhash, quota and ihash init() functions.


98888 26-Jun-2002 iedowse

Remove the kernel file-size limit for UFS2, so that only the limit
imposed by the filesystem structure itself remains. With 16k blocks,
the maximum file size is now just over 128TB.

For now, the UFS1 file size limit is left unchanged so as to remain
consistent with RELENG_4, but it too could be removed in the future.

Reviewed by: mckusick


98849 26-Jun-2002 ken

At long last, commit the zero copy sockets code.

MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes.

ti.4: Update the ti(4) man page to include information on the
TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
and also include information about the new character
device interface and the associated ioctls.

man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated
links.

jumbo.9: New man page describing the jumbo buffer allocator
interface and operation.

zero_copy.9: New man page describing the general characteristics of
the zero copy send and receive code, and what an
application author should do to take advantage of the
zero copy functionality.

NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.

conf/files: Add uipc_jumbo.c and uipc_cow.c.

conf/options: Add the 5 options mentioned above.

kern_subr.c: Receive side zero copy implementation. This takes
"disposable" pages attached to an mbuf, gives them to
a user process, and then recycles the user's page.
This is only active when ZERO_COPY_SOCKETS is turned on
and the kern.ipc.zero_copy.receive sysctl variable is
set to 1.

uipc_cow.c: Send side zero copy functions. Takes a page written
by the user and maps it copy on write and assigns it
kernel virtual address space. Removes copy on write
mapping once the buffer has been freed by the network
stack.

uipc_jumbo.c: Jumbo disposable page allocator code. This allocates
(optionally) disposable pages for network drivers that
want to give the user the option of doing zero copy
receive.

uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are
enabled if ZERO_COPY_SOCKETS is turned on.

Add zero copy send support to sosend() -- pages get
mapped into the kernel instead of getting copied if
they meet size and alignment restrictions.

uipc_syscalls.c:Un-staticize some of the sf* functions so that they
can be used elsewhere. (uipc_cow.c)

if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
calling malloc() with M_WAITOK. Return an error if
the M_NOWAIT malloc fails.

The ti(4) driver and the wi(4) driver, at least, call
this with a mutex held. This causes witness warnings
for 'ifconfig -a' with a wi(4) or ti(4) board in the
system. (I've only verified for ti(4)).

ip_output.c: Fragment large datagrams so that each segment contains
a multiple of PAGE_SIZE amount of data plus headers.
This allows the receiver to potentially do page
flipping on receives.

if_ti.c: Add zero copy receive support to the ti(4) driver. If
TI_PRIVATE_JUMBOS is not defined, it now uses the
jumbo(9) buffer allocator for jumbo receive buffers.

Add a new character device interface for the ti(4)
driver for the new debugging interface. This allows
(a patched version of) gdb to talk to the Tigon board
and debug the firmware. There are also a few additional
debugging ioctls available through this interface.

Add header splitting support to the ti(4) driver.

Tweak some of the default interrupt coalescing
parameters to more useful defaults.

Add hooks for supporting transmit flow control, but
leave it turned off with a comment describing why it
is turned off.

if_tireg.h: Change the firmware rev to 12.4.11, since we're really
at 12.4.11 plus fixes from 12.4.13.

Add defines needed for debugging.

Remove the ti_stats structure, it is now defined in
sys/tiio.h.

ti_fw.h: 12.4.11 firmware.

ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13,
and my header splitting patches. Revision 12.4.13
doesn't handle 10/100 negotiation properly. (This
firmware is the same as what was in the tree previously,
with the addition of header splitting support.)

sys/jumbo.h: Jumbo buffer allocator interface.

sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to
indicate that the payload buffer can be thrown away /
flipped to a userland process.

socketvar.h: Add prototype for socow_setup.

tiio.h: ioctl interface to the character portion of the ti(4)
driver, plus associated structure/type definitions.

uio.h: Change prototype for uiomoveco() so that we'll know
whether the source page is disposable.

ufs_readwrite.c:Update for new prototype of uiomoveco().

vm_fault.c: In vm_fault(), check to see whether we need to do a page
based copy on write fault.

vm_object.c: Add a new function, vm_object_allocate_wait(). This
does the same thing that vm_object allocate does, except
that it gives the caller the opportunity to specify whether
it should wait on the uma_zalloc() of the object structre.

This allows vm objects to be allocated while holding a
mutex. (Without generating WITNESS warnings.)

vm_object_allocate() is implemented as a call to
vm_object_allocate_wait() with the malloc flag set to
M_WAITOK.

vm_object.h: Add prototype for vm_object_allocate_wait().

vm_page.c: Add page-based copy on write setup, clear and fault
routines.

vm_page.h: Add page based COW function prototypes and variable in
the vm_page structure.

Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.


98788 25-Jun-2002 mckusick

Force the quota update to be done when an inode is released in
ufs_inactive. This avoid a panic when checking a NULL credential
in suser_cred().


98770 24-Jun-2002 jlemon

Prototype fixes (long newinum --> ino_t newinum).


98687 23-Jun-2002 mux

Warning fixes for 64 bits platforms. This eliminates all the
warnings I have had in the FFS code on sparc64.

Reviewed by: mckusick


98658 23-Jun-2002 dillon

Rename the BALLOC flags from B_* to BA_* to avoid confusion with the
struct buf B_ flags.

Approved by: mckusick


98640 22-Jun-2002 mckusick

This patch fixes a problem whereby filesystems that ran
out of inodes in a cylinder group would fail to check for
free inodes in other cylinder groups. This bug was introduced
in the UFS2 code merge two days ago.

An inode is allocated by calling ffs_valloc which calls
ffs_hashalloc to do the filesystem scan. Ffs_hashalloc
walks around the cylinder groups calling its passed allocator
(ffs_nodealloccg in this case) until the allocator returns a
non-zero result. The bug is that ffs_hashalloc expects the
passed allocator function to return a 64-bit ufs2_daddr_t.
When allocating inodes, it calls ffs_nodealloccg which was
returning a 32-bit ino_t. The ffs_hashalloc code checked
a 64-bit return value and usually found random non-zero bits in
the high 32-bits so decided that the allocation had succeeded
(in this case in the only cylinder group that it checked).
When the result was passed back to ffs_valloc it looked at
only the bottom 32-bits, saw zero and declared the system
out of inodes. But ffs_hashalloc had really only checked
one cylinder group.

The fix is to change ffs_nodealloccg to return 64-bit results.

Sponsored by: DARPA & NAI Labs.
Submitted by: Poul-Henning Kamp <phk@critter.freebsd.dk>
Reviewed by: Maxime Henrion <mux@freebsd.org>


98542 21-Jun-2002 mckusick

This commit adds basic support for the UFS2 filesystem. The UFS2
filesystem expands the inode to 256 bytes to make space for 64-bit
block pointers. It also adds a file-creation time field, an ability
to use jumbo blocks per inode to allow extent like pointer density,
and space for extended attributes (up to twice the filesystem block
size worth of attributes, e.g., on a 16K filesystem, there is space
for 32K of attributes). UFS2 fully supports and runs existing UFS1
filesystems. New filesystems built using newfs can be built in either
UFS1 or UFS2 format using the -O option. In this commit UFS1 is
the default format, so if you want to build UFS2 format filesystems,
you must specify -O 2. This default will be changed to UFS2 when
UFS2 proves itself to be stable. In this commit the boot code for
reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c)
as there is insufficient space in the boot block. Once the size of the
boot block is increased, this code can be defined.

Things to note: the definition of SBSIZE has changed to SBLOCKSIZE.
The header file <ufs/ufs/dinode.h> must be included before
<ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and
ufs_lbn_t.

Still TODO:
Verify that the first level bootstraps work for all the architectures.
Convert the utility ffsinfo to understand UFS2 and test growfs.
Add support for the extended attribute storage. Update soft updates
to ensure integrity of extended attribute storage. Switch the
current extended attribute interfaces to use the extended attribute
storage. Add the extent like functionality (framework is there,
but is currently never used).

Sponsored by: DARPA & NAI Labs.
Reviewed by: Poul-Henning Kamp <phk@freebsd.org>


98425 19-Jun-2002 dillon

In rev 1.72 a situation related to write/mmap was fixed which could result
in a user process gaining visibility into the 'old' contents of a filesystem
block. There were two cases: (1) when uiomove() fails (user process issues
illegal write), and (2) when uiomove() overlaps a mmap() of the same file at
the same offset (fault -> recursive buffer I/O reads contents of old block).

Unfortunately 1.72 also had the unintended effect of forcing the filesystem
to do a read-before-write in the case of a full-block-write (non append case),
e.g. 'dd if=/dev/zero of=test.dat bs=1m count=256 conv=notrunc'. This
destroys performance.. not only is a read forced for every write, but
clustering breaks as well.

The solution is to clear the buffer manually in the full-block case rather
then asking BALLOC to do it (BALLOC issues the read-before-write). In the
partial-block case we want BALLOC to do it because the read-before-write
is necessary. This patch should greatly improve database and news-feed
server performance.

Found by: MKI <mki@mozone.net>
MFC after: 3 days


97962 06-Jun-2002 semenu

Fix a typo in my recently added comment: s/beleived/believed/

Submitted by: keramida


97724 01-Jun-2002 alfred

Backout/modify previous revision:
"empty default cases shouldn't be removed, they should have a break;
statement added to them."

Requested by: billf


97723 01-Jun-2002 alfred

Silence warnings, remove some empty 'default' switch cases.


97640 30-May-2002 semenu

Remove lock from ffs_vget introduced by v1.24. Instead of locking the
vnode creation globaly, we allow processes to create vnodes concurently.
In case of concurent creation of vnode for the one ino, we allow processes
to race and then check who wins.

Assuming that concurent creation of vnode for same ino is really rare case,
this is belived to be an improvement, as it just allows concurent creation
of vnodes.

Idea by: bp
Reviewed by: dillon
MFC after: 1 month


96885 19-May-2002 rwatson

Remove IFS from 5.0-CURRENT. This facilitates introducing UFS2 as
IFS had its fingers deep in the belly of the UFS/FFS split. IFS
will be reimplemented by the maintainer at a later date.

Requested by: adrian (maintainer)


96876 18-May-2002 iedowse

Fix two casts to "daddr_t *" that should have been "ufs_daddr_t *".


96874 18-May-2002 iedowse

Fix a typo where sizeof(daddr_t) was specified instead of sizeof(doff_t).
Now that daddr_t is 64-bit, this caused hash blocks to be allocated
twice as large as they need to be.


96873 18-May-2002 iedowse

Remove um_i_effnlink_valid, i_spare[] and the ufsmount_u and inode_u
unions, since these were only necessary when ext2fs used ufs code.

Reviewed by: mckusick


96821 17-May-2002 phk

Fix ufs_daddr_t/daddr_t type problems.

Sponsored by: DARPA & NAI labs.


96820 17-May-2002 phk

Call ufs_bmaparray() with right parameter type.

Sponsored by: DARPA & NAI Labs.


96755 16-May-2002 trhodes

More s/file system/filesystem/g


96572 14-May-2002 phk

Make daddr_t and u_daddr_t 64bits wide.
Retire daddr64_t and use daddr_t instead.

Sponsored by: DARPA & NAI Labs.


96506 13-May-2002 phk

Remove register keyword.

Sponsored by: DARPA & NAI Labs.
Submitted by: mckusick


96482 12-May-2002 phk

Remove two "register" and a blank line.

Submitted by: mckusick
Sponsored by: DARPA & NAI Labs.


96473 12-May-2002 phk

ARGH! SBLOCK is not unused. Try to get this right.

BBSIZE belongs in <sys/disklabel.h> (but shouldn't be a constant).

Define SBLOCK again, using the right math.

Sponsored by: DARPA & NAI Labs.


96472 12-May-2002 phk

Remove #define for BBOFF, it is assumed == 0 so many places that we might
as well forget about it. In fact the only thing which used it was the
SBOFF macro.

Sponsored by: DARPA & NAI Labs.


96471 12-May-2002 phk

Remove unused BBLOCK and SBLOCK #defines.

Sponsored by: DARPA & NAI Labs.


96095 06-May-2002 alc

o Condition the compilation and use of vm_freeze_copyopts()
on ENABLE_VFS_IOOPT.


96072 05-May-2002 phk

Move some UFS related stuff home where it belongs.


96010 04-May-2002 jeff

Include systm.h so panic(9) is defined when doing DEBUG_ALL_VFS_LOCKS.


95974 03-May-2002 phk

Name ufs_vop_[gs]etextattr() consistently with the rest of our VOPs and
put then in the ufs_vnops where they belong, rather than in the ffs_vnops.

Ok'ed by: rwatson
Sponsored by: DARPA & NAI Labs.


95945 02-May-2002 phk

Use vop_panic() instead of our home-rolled version.


94996 18-Apr-2002 alfred

Remove support for using soon to be retired "special" poll(2) ops.
Replace with kevent(2) ops.

This is untested, but the code would rot even further if this wasn't
applied. I've chosen to apply this to prompt some cleanup.

Submitted by: bde


94723 15-Apr-2002 jeff

Don't peak into the malloc_type structure for limits. The desired vnodes
check should be sufficient. This is required for the pending removal of
malloc_type limits.


94182 08-Apr-2002 phk

Move generic disk ioctls from <sys/disklabel.h> to <sys/disk.h>.

Sponsored by: DARPA & NAI Labs


93818 04-Apr-2002 jhb

Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on: i386, alpha, sparc64


93736 03-Apr-2002 phk

Move the FFS parameter MAXFRAG from <sys/param.h> to <ufs/ffs/fs.h>

Sponsored by: DARPA & NAI Labs.


93654 02-Apr-2002 phk

Use DIOCGSECTORSIZE instead of the bogus DIOCGPART ioctl.


93593 01-Apr-2002 jhb

Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on: smp@


93430 30-Mar-2002 bde

In ffs_mountffs(), set mnt_iosize_max to si_iosize_max unconditionally
provided the latter is nonzero. At this point, the former is a fairly
arbitrary default value (DFTPHYS), so changing it to any reasonable
value specified by the device driver is safe. Using the maximum of
these limits broke ffs clustered i/o for devices whose si_iosize_max
is < DFLTPHYS. Using the minimum would break device drivers' ability
to increase the active limit from DFTLPHYS up to MAXPHYS.

Copied the code for this and the associated (unnecessary?) fixup of
mp_iosize_max to all other filesystems that use clustering (ext2fs and
msdosfs). It was completely missing.

PR: 36309
MFC-after: 1 week


92807 20-Mar-2002 dwmalone

Two minor changes to dirhash, which result in some marginal benchmark
improvements.

1) If deleting an entry results in a chain of deleted slots ending in an
empty slot, then we can be a bit more aggressive about marking slots as
empty.

2) The last stage of the FNV hash is to xor the last byte of data
into the hash. This means that filenames which differ only in
the last byte will be placed close to one another in the hash
table, which forms longer chains. To work around this common
case, we also hash in the address of the dirhash structure.

news/cancel = news/articles/control/cancel for a tradspool inn server
squid2 = squid level 2 directory (dirs called 00->FF)
squid3 = squid level 3 directory (files called 00001F00->00001FFF)

mean #probes for
home dir mh inbox news/cancel tmp squid2 squid3
old successful 1.02 3.19 4.07 1.10 7.85 2.06
new successful 1.04 1.32 1.27 1.04 1.93 1.17

old unsuccessful 1.08 4.50 5.37 1.17 10.76 2.69
new unsuccessful 1.08 1.73 1.64 1.17 2.89 1.37

Reviewed by: iedowse
MFC after: 2 weeks


92768 20-Mar-2002 jeff

Remove references to vm_zone.h and switch over to the new uma API.


92728 19-Mar-2002 alfred

Remove __P.


92640 19-Mar-2002 bde

Fixed some printf format errors (hopefully all of the remaining daddr64_t
ones for GENERIC, and all others on the same line as those). Reformat
the printfs if necessary to avoid new long lones or old format printf
errors.


92462 17-Mar-2002 mckusick

Add a flags parameter to VFS_VGET to pass through the desired
locking flags when acquiring a vnode. The immediate purpose is
to allow polling lock requests (LK_NOWAIT) needed by soft updates
to avoid deadlock when enlisting other processes to help with
the background cleanup. For the future it will allow the use of
shared locks for read access to vnodes. This change touches a
lot of files as it affects most filesystems within the system.
It has been well tested on FFS, loopback, and CD-ROM filesystems.
only lightly on the others, so if you find a problem there, please
let me (mckusick@mckusick.com) know.


92363 15-Mar-2002 mckusick

Introduce the new 64-bit size disk block, daddr64_t. Change
the bio and buffer structures to have daddr64_t bio_pblkno,
b_blkno, and b_lblkno fields which allows access to disks
larger than a Terabyte in size. This change also requires
that the VOP_BMAP vnode operation accept and return daddr64_t
blocks. This delta should not affect system operation in
any way. It merely sets up the necessary interfaces to allow
the development of disk drivers that work with these larger
disk block addresses. It also allows for the development of
UFS2 which will use 64-bit block addresses.


92299 15-Mar-2002 obrien

Quiet a warning on the Alpha.


92250 14-Mar-2002 mckusick

This corrects the first of two known deadlock conditions that
come from the presence of a snapshot file.


92098 11-Mar-2002 iedowse

Fix a bug in ufsdirhash_adjfree() that caused it to incorrectly
update the free-space statistics in some cases. The problem affected
directory blocks when the free space dropped below the size of the
maximum allowed entry size. When this happened, the free-space
summary information could claim that there are no further blocks
that can fit a maximum-size entry, even if there are.

The effect of this bug is that the directory may be enlarged even
though there is space within the directory for the new entry. This
wastes disk space and has a negative impact on performance.

Fix it by correctly computing the dh_firstfree array index, adding
a helper macro for clarity. Put an extra sanity check into
ufsdirhash_checkblock() to detect the situation in future.

Found by: dwmalone
Reviewed by: dwmalone
MFC after: 1 week


92095 11-Mar-2002 phk

I missed one VOP_CLOSE in the previous commit.

Pointed out by: bde


92092 11-Mar-2002 phk

As a XXX bandaid open the mounted device READ/WRITE even if we only mount
read-only.

The trouble here is that we don't reopen the device in read/write mode
when we remount in read/write mode resulting in a filesystem sending
write requests to a device which was only opened read/only.

I'm not quite sure how such a reopen would best be done and defer
the problem to more agile hackers.


91825 07-Mar-2002 rwatson

Update DBA for NAI. We have several. We used the wrong one. :-)


91814 07-Mar-2002 green

Add new errno ``ENOATTR''.


91720 06-Mar-2002 dillon

cleanup readability syntax prior to ongoing b_resid work commits.

MFC after: 1 day


91420 27-Feb-2002 jhb

Use thread0.td_ucred instead of proc0.p_ucred. This change is cosmetic
and isn't strictly required. However, it lowers the number of false
positives found when grep'ing the kernel sources for p_ucred to ensure
proper locking.


91406 27-Feb-2002 jhb

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


91060 22-Feb-2002 phk

Replace bowrite() with BUF_WRITE in ufs.

Remove bowrite(), it is now unused.

This is the first step in getting entirely rid of BIO_ORDERED which is
a generally accepted evil thing.

Approved by: mckusick


90972 20-Feb-2002 rwatson

o Minor style fix on #endif, missing '_' in comment.


90860 18-Feb-2002 phk

Make v_addpollinfo() visible and non-inline.
Have callers only call it as needed.
Add necessary call in ufs_kqfilter().

Test-case found by: Andrew Gallatin <gallatin@cs.duke.edu>


90791 17-Feb-2002 phk

Move the stuff related to select and poll out of struct vnode.
The use of the zone allocator may or may not be overkill.
There is an XXX: over in ufs/ufs/ufs_vnops.c that jlemon may need
to revisit.

This shaves about 60 bytes of struct vnode which on my laptop means
600k less RAM used for vnodes.


90790 17-Feb-2002 phk

Collect the VN_KNOTE() macro definitions on vnode.h


90538 11-Feb-2002 julian

In a threaded world, differnt priorirites become properties of
different entities. Make it so.

Reviewed by: jhb@freebsd.org (john baldwin)


90453 10-Feb-2002 rwatson

Minor style tweaks.

Remove an unneeded comment and commented out code that won't be
needed.


90452 10-Feb-2002 rwatson

Copyright + license update.


90448 10-Feb-2002 rwatson

Part I: Update extended attribute API and ABI:

o Modify the system call syntax for extattr_{get,set}_{fd,file}() so
as not to use the scatter gather API (which appeared not to be used
by any consumers, and be less portable), rather, accepts 'data'
and 'nbytes' in the style of other simple read/write interfaces.
This changes the API and ABI.

o Modify system call semantics so that extattr_get_{fd,file}() return
a size_t. When performing a read, the number of bytes read will
be returned, unless the data pointer is NULL, in which case the
number of bytes of data are returned. This changes the API only.

o Modify the VOP_GETEXTATTR() vnode operation to accept a *size_t
argument so as to return the size, if desirable. If set to NULL,
the size will not be returned.

o Update various filesystems (pseodofs, ufs) to DTRT.

These changes should make extended attributes more useful and more
portable. More commits to rebuild the system call files, as well
as update userland utilities to follow.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


90438 10-Feb-2002 phk

Remove di_inumber since LFS is long gone.


90366 07-Feb-2002 mckusick

Occationally background fsck would cause a spurious ``freeing free
inode'' panic. This change corrects that problem by setting the
fs_active flag when the inode map changes to notify the snapshot
code that the cylinder group must be rescanned.

Submitted by: Robert Watson <rwatson@FreeBSD.org>


90329 07-Feb-2002 mckusick

Occationally deleted files would hang around for hours or days
without being reclaimed. This bug was introduced in revision 1.95
dealing with filenames placed in newly allocated directory blocks,
thus is not present in 4.X systems. The bug is triggered when a
new entry is made in a directory after the data block containing
the original new entry has been written, but before the inode
that references the data block has been written.

Submitted by: Bill Fenner <fenner@research.att.com>


90098 02-Feb-2002 mckusick

When taking a snapshot, we must check for active files that have
been unlinked (e.g., with a zero link count). We have to expunge
all trace of these files from the snapshot so that they are neither
reclaimed prematurely by fsck nor saved unnecessarily by dump.


89680 23-Jan-2002 mckusick

Add a stub for softdep_request_cleanup() so that compilation without
SOFTUPDATES option works properly.

Submitted by: Benno Rice <benno@jeamland.net>


89637 22-Jan-2002 mckusick

This patch fixes a long standing complaint with soft updates in
which small and/or nearly full filesystems would fail with `file
system full' messages when trying to replace a number of existing
files (for example during a system installation). When the allocation
routines are about to fail with a file system full condition, they
make a call to softdep_request_cleanup() which attempts to accelerate
the flushing of pending deletion requests in an effort to free up
space. In the face of filesystem I/O requests that exceed the
available disk transfer capacity, the cleanup request could take
an unbounded amount of time. Thus, the softdep_request_cleanup()
routine will only try for tickdelay seconds (default 2 seconds)
before giving up and returning a filesystem full error. Under typical
conditions, the softdep_request_cleanup() routine is able to free
up space in under fifty milliseconds.


89450 17-Jan-2002 mckusick

Fix a bug introduced in ffs_snapshot.c -r1.25 and fs.h -r1.26
which caused incomplete snapshots to be taken. When background
fsck would run on these snapshots, the result would be files
being incorrectly released which would subsequently panic the
kernel with ``handle_workitem_freefile: inodedep survived'',
``handle_written_inodeblock: live inodedep'', and
``handle_workitem_remove: lost inodedep'' errors.


89413 16-Jan-2002 mckusick

Put write on read-only filesystem panic after we have weeded out
block and character devices, fifo's, etc.

Submitted by: Bruce Evans <bde@zeta.org.au>


89384 15-Jan-2002 mckusick

When downgrading a filesystem from read-write to read-only, operations
involving file removal or file update were not always being fully
committed to disk. The result was lost files or corrupted file data.
This change ensures that the filesystem is properly synced to disk
before the filesystem is down-graded.

This delta also fixes a long standing bug in which a file open for
reading has been unlinked. When the last open reference to the file
is closed, the inode is reclaimed by the filesystem. Previously,
if the filesystem had been down-graded to read-only, the inode could
not be reclaimed, and thus was lost and had to be later recovered
by fsck. With this change, such files are found at the time of the
down-grade. Normally they will result in the filesystem down-grade
failing with `device busy'. If a forcible down-grade is done, then
the affected files will be revoked causing the inode to be released
and the open file descriptors to begin failing on attempts to read.

Submitted by: "Sam Leffler" <sam@errno.com>


89306 13-Jan-2002 alfred

SMP Lock struct file, filedesc and the global file list.

Seigo Tanimura (tanimura) posted the initial delta.

I've polished it quite a bit reducing the need for locking and
adapting it for KSE.

Locks:

1 mutex in each filedesc
protects all the fields.
protects "struct file" initialization, while a struct file
is being changed from &badfileops -> &pipeops or something
the filedesc should be locked.

1 mutex in each struct file
protects the refcount fields.
doesn't protect anything else.
the flags used for garbage collection have been moved to
f_gcflag which was the FILLER short, this doesn't need
locking because the garbage collection is a single threaded
container.
could likely be made to use a pool mutex.

1 sx lock for the global filelist.

struct file * fhold(struct file *fp);
/* increments reference count on a file */

struct file * fhold_locked(struct file *fp);
/* like fhold but expects file to locked */

struct file * ffind_hold(struct thread *, int fd);
/* finds the struct file in thread, adds one reference and
returns it unlocked */

struct file * ffind_lock(struct thread *, int fd);
/* ffind_hold, but returns file locked */

I still have to smp-safe the fget cruft, I'll get to that asap.


89295 12-Jan-2002 mckusick

When going to sleep, we must save our SPL so that it does not get
lost if some other process uses the lock while we are sleeping. We
restore it after we have slept. This functionality is provided by
a new routine interlocked_sleep() that wraps the interlocking with
functions that sleep. This function is then used in place of the
old ACQUIRE_LOCK_INTERLOCKED() and FREE_LOCK_INTERLOCKED() macros.

Submitted by: Debbie Chu <dchu@juniper.net>


89270 11-Jan-2002 mckusick

Must call drain_output() before checking the dirty block list
in softdep_sync_metadata(). Otherwise we may miss dependencies
that need to be flushed which will result in a later panic
with the message ``vinvalbuf: dirty bufs''.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
MFC after: 1 week


89213 10-Jan-2002 phk

Do not pull quota entries of the cache-list if they have already
been removed from the cache-list as part of a previous unmount.

This would result in panics (page fault in dqflush()) during subsequent
umounts provided that enough distinct UID's to actually make the
hash do something are active.

This can probably explain a number of weird quota related behaviours.

PR: 32331 maybe more.
Reproduced by: Søren Schrørder <sch@cybercity.dk>


89089 08-Jan-2002 msmith

Initialise the bioops vector hack at runtime rather than at link time. This
avoids the use of common variables.

Reviewed by: mckusick


88318 20-Dec-2001 dillon

Fix a BUF_TIMELOCK race against BUF_LOCK and fix a deadlock in vget()
against VM_WAIT in the pageout code. Both fixes involve adjusting
the lockmgr's timeout capability so locks obtained with timeouts do not
interfere with locks obtained without a timeout.

Hopefully MFC: before the 4.5 release


88138 18-Dec-2001 mckusick

Change the atomic_set_char to atomic_set_int and atomic_clear_char
to atomic_clear_int to ease the implementation for the sparc64.

Requested by: Jake Burkholder <jake@locore.ca>


88026 16-Dec-2001 iedowse

Make sure we ignore the value of `fs_active' when reloading the
superblock, and move the initialisation of it to beside where other
pointer fields are initialised.


88025 16-Dec-2001 iedowse

Move the new superblock field `fs_active' into the region of the
superblock that is already set up to handle pointer types. This
fixes an accidental change in the superblock size on 64-bit platforms
caused by revision 1.24.


87827 14-Dec-2001 mckusick

Minimize the time necessary to suspend operations on a filesystem
when taking a snapshot. The two time consuming operations are
scanning all the filesystem bitmaps to determine which blocks
are in use and scanning all the other snapshots so as to be able
to expunge their blocks from the view of the current snapshot.
The bitmap scanning is broken into two passes. Before suspending
the filesystem all bitmaps are scanned. After the suspension,
those bitmaps that changed after being scanned the first time
are rescanned. Typically there are few bitmaps that need to be
rescanned. The expunging of other snapshots is now done after
the suspension is released by observing that we can easily
identify any blocks that were allocated to them after the
suspension (they will be maked as `not needing to be copied'
in the just created snapshot). For all the gory details, see
the ``Running fsck in the Background'' paper in the Usenix
BSDCon 2002 Conference Proceedings, pages 55-64.


87782 13-Dec-2001 mckusick

When a file is partially truncated, we first check to see if the
new file end will land in the middle of a file hole. Since the last
block of a file must always be allocated, the hole is filled by
allocating a block at that location. If the hole being filled is
a direct block, then the truncation may eventually reduce the
full sized block down to a fragment. When running with soft
updates, it is necessary to FSYNC the file after allocating the
block and before creating the fragment to avoid triggering a
soft updates inconsistency when the block unexpectedly shrinks.

Found by: Matthew Dillon <dillon@apollo.backplane.com>
MFC after: 1 week


87133 30-Nov-2001 rwatson

Use 'mkdir -p /.attribute/system' instead of breaking it into
two seperate mkdir targets.

Submitted by: jedgar


87132 30-Nov-2001 rwatson

Use 'mkdir -p /.attribute/system' instead of breaking it into
two seperate mkdir targets.


87131 30-Nov-2001 rwatson

README.extattr incorrectly specified sample command lines for
UFS_EXTATTR_AUTOSTART. Insert the missing 'initattr' arguments
to extattrctl.

Noticed by: green


86782 22-Nov-2001 guido

When mkdir()-ing, the parent dir gets is linkcount increased.
Fix VN_KNOTE to reflect that.

Found by: tobez@freebsd.org
MFC after: 2 days


86350 14-Nov-2001 iedowse

Oops, when trying the dirhash sequential-access optimisation,
compare the slot offset against the predicted offset, not a boolean
flag. This typo effectively disabled the sequential optimisation,
but was otherwise harmless.

Not surprisingly, fixing this improves performance in the sequential
access case. I am seeing a 7% speedup on one machine here; using
dirhash when sequentially looking up directory entries is now about
5% faster instead of 2% slower than the non-dirhash case.

Submitted by: KOIE Hidetaka <koie@suri.co.jp>
MFC after: 1 week


86089 05-Nov-2001 dillon

Implement IO_NOWDRAIN and B_NOWDRAIN - prevents the buffer cache from blocking
in wdrain during a write. This flag needs to be used in devices whos
strategy routines turn-around and issue another high level I/O, such as
when MD turns around and issues a VOP_WRITE to vnode backing store, in order
to avoid deadlocking the dirty buffer draining code.

Remove a vprintf() warning from MD when the backing vnode is found to be
in-use. The syncer of buf_daemon could be flushing the backing vnode at
the time of an MD operation so the warning is not correct.

MFC after: 1 week


85845 01-Nov-2001 rwatson

o Update copyright dates.
o Add reference to TrustedBSD Project in license header.
o Update dated comments, including comment in extattr.h claiming that
no file systems support extended attributes.
o Improve comment consistency.


85581 27-Oct-2001 rwatson

o Althought this is not specified in POSIX.1e, the UFS ACL implementation
coerces the deletion of a default ACL on a directory when no default
ACL EA is present to success. Because the UFS EA implementation doesn't
disinguish the EA failure modes "that EA name has not been
administratively enabled" from "that EA name has no defined data",
there's a potential conflict in error return values. Normally, the
lack of administratively configured EA support is coerced to
EOPNOTSUPP to indicate that ACLs are not available; in this case,
it is possible to get a successful return, even if ACLs are not
available because EA support for them has not been enabled.

Expand the comment in ufs_setacl() to identify this case.

Obtained from: TrustedBSD Project


85580 27-Oct-2001 rwatson

o Clarify a comment about the locking condition of the vnode upon exit
from ufs_extattr_enable_with_open().
o Print auto-start notifications if (bootverbose). This was previously
commented out since it didn't know how to check for bootverbose.
o Drop in comments throughout indicating where ENOENT should be replaced
with ENOATTR once that is available.

Obtained from: TrustedBSD Project


85579 27-Oct-2001 rwatson

o The comment about ordering the destruction of the lock and the removal of
the flag indicating that the structure was initialized didn't need
an XXX, since it didn't need fixing.

Obtained from: TrustedBSD Project


85578 27-Oct-2001 rwatson

o Wrap a number of long lines of code, many of which were introduced
due to KSE-related (p) expansions.

Obtained from: TrustedBSD Project


85577 27-Oct-2001 rwatson

Since namespace support was added to the UFS extended attribute
implementation to replace single-character namespace prefixes, '$' is no
longer an invalid attribute name, and the namespace is relevant to
validity determination.

o Remove '$' case from ufs_extattr_valid_attrname()
o Add attrnamespace argument to ufs_extattr_valid_attrname(), and
fill out appropriately.

Currently no decisions are made based on the namespace argument, but
may be in the future.

Obtained from: TrustedBSD Project


85517 26-Oct-2001 dillon

Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a
real effect.

Optimize vfs_msync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. Improves looping case by 500%.

Optimize ffs_sync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. This makes a couple of assumptions,
which I believe are ok, in regards to vnode stability when the mount list
mutex is held. Improves looping case by 500%.

(more optimization work is needed on top of these fixes)

MFC after: 1 week


85512 25-Oct-2001 iedowse

Default to not performing ufs_dirhash's extensive directory-block
sanity check after every directory modification. This check can be
re-enabled at any time by setting the sysctl "vfs.ufs.dirhash_docheck"
to 1.

This group of sanity tests was there to ensure that any UFS_DIRHASH
bugs could be caught by a panic before a potentially corrupted
directory block would be written to disk. It has served its main
purpose now, so disable it in the interest of performance.

MFC after: 1 week


85339 23-Oct-2001 dillon

Change the vnode list under the mount point from a LIST to a TAILQ
in preparation for an implementation of limiting code for kern.maxvnodes.

MFC after: 3 days


84827 11-Oct-2001 jhb

Change the kernel's ucred API as follows:
- crhold() returns a reference to the ucred whose refcount it bumps.
- crcopy() now simply copies the credentials from one credential to
another and has no return value.
- a new crshared() primitive is added which returns true if a ucred's
refcount is > 1 and false (0) otherwise.


84811 11-Oct-2001 jhb

Add missing includes of sys/lock.h.


84642 08-Oct-2001 dillon

Remove panics for rename() race conditions. The panics are inappropriate
because the IN_RENAME flag only fixes a few of the huge number of race
conditions that can result in the source path becoming invalid even
prior to the VOP_RENAME() call. The panics created a serious security
issue whereby an attacker could fairly easily cause the panic to
occur, crashing the machine.

The correct solution requires a great deal of work in the namei
path cache code.

MFC after: 0 days


84374 02-Oct-2001 rwatson

o Replace two direct uid!=0 comparisons with suser_xxx() calls.

Obtained from: TrustedBSD Project


84373 02-Oct-2001 rwatson

o Replace two direct uid!=0 comparisons with suser_td() calls.

Obtained from: TrustedBSD Project


84344 02-Oct-2001 dillon

Backout the last commit. The problem is actually much worse then I
first thought and may require serious work to the VOP_RENAME() api itself.
Basically, by the time the VOP_RENAME() function is called, it's already
too late.


84339 02-Oct-2001 dillon

IN_RENAME should only be cleared by the routine that set it. This fixes
a rename/rmdir race that has been shown to cause a panic.

Bug reported by: Yevgeniy Aleynikov <eugenea@infospace.com>
MFC after: 3 days


84050 27-Sep-2001 jhb

- Fix some minor whitespace nits.
- Move the SPECIAL_FLAG #define up next to the NOHOLDER #define and fix a
little nit that caused it to be defined as -(sizeof (struct thread) + 1)
instead of -2.


83992 26-Sep-2001 rwatson

o Re-enable support of system file flags in jail() by adding back the
PRISON_ROOT to the suser_xxx() check. Since securelevels may now
be raised in specific jails, use of system flags can still be
restricted in jail(), but in a more configurable way.
o Users of jail() expecting system flags (such as schg) to restrict
jail()'s should be sure to set the securelevel appropriately in
jail()'s.
o This fixes activities involving automated system flag removal in
jail(), including installkernel and friends.

Obtained from: TrustedBSD Project


83987 26-Sep-2001 rwatson

o Modify ufs_setattr() so that it uses securelevel_gt() instead of
direct variable access.

Obtained from: TrustedBSD Project


83924 25-Sep-2001 rwatson

o Further clarify comment: ad Udo's request, re-insert the 'if'
refering to securelevels; also, update the unprivileged process text
to better indicate the scope of actions permittable when any system
flags are already set (limited).

Submitted by: Udo Schweigert <udo.schweigert@siemens.com>


83918 25-Sep-2001 rwatson

o Parallelize the comment on the relationship between privileged un-jailed
processes and the actual securelevel check: make the comment use '> 0'
instead of inverted '<= 0'.


83899 24-Sep-2001 iedowse

The addition of i_dirhash to struct inode pushed RELENG_4's
sizeof(struct inode) into a new malloc bucket on the i386. This
didn't happen in -current due to the removal of i_lock, but it does
no harm to apply the workaround to -current first.

Reduce the size of the i_spare[] array in struct inode from 4 to
3 entries, and change ext2fs to use i_din.di_spare[1] so that it
does not need i_spare[3].

Reviewed by: bde
MFC after: 3 days


83366 12-Sep-2001 julian

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


83263 09-Sep-2001 iedowse

The "dirpref" directory layout preference improvements make use of
an array "fs_contigdirs[]" to avoid too many directories getting
created in each cylinder group. The memory required for this and
two other arrays (fs_csp[] and fs_maxcluster[]) is allocated with
a single malloc() call, and divided up afterwards. However, the
'space' pointer is not advanced correctly, so fs_contigdirs and
fs_maxcluster end up pointing to the same address.

Add the missing code to advance the 'space' pointer, and remove
an unnecessary update of the pointer that follows.

This is likely to fix the "ffs_clusteralloc: map mismatch" panics
that have been reported recently.

Submitted by: Luke Mewburn <lukem@wasabisystems.com>


82770 01-Sep-2001 jedgar

Use ACL_PERM_NONE instead of hardcoding 0 when initializing
ACL entry permissions.

Reviewed by: rwatson


82755 01-Sep-2001 rwatson

o At some point, unmounting a non-EA file system with EA's compiled
in got a bit broken, when ufs_extattr_stop() was called and failed,
ufs_extattr_destroy() would panic. This makes the call to destroy()
conditional on the success of stop().

Submitted by: Christian Carstensen <cc@devcon.net>
Obtained from: TrustedBSD Project


82395 27-Aug-2001 peter

If a file has been completely unlinked, stop automatically syncing the
file. ffs will discard any pending dirty pages when it is closed,
so we may as well not waste time trying to clean them. This doesn't
stop other things from writing it out, eg: pageout, fsync(2) etc.


82364 26-Aug-2001 iedowse

Stop using dirhash when a directory is removed, and ensure that we
never attempt to hash directories once they are deleted. This fixes
a problem where operations on a deleted directory could trigger
dirhash sanity panics.


82334 26-Aug-2001 iedowse

When compacting directories, ufs_direnter() always trusted DIRSIZ()
to supply the number of bytes to be bcopy()'d to move an entry. If
d_ino == 0 however, DIRSIZ() is not guaranteed to return a sensible
length, so ufs_direnter could end up corrupting a directory during
compaction. In practice I believe this can only happen after fsck_ffs
has fixed a previously-corrupted directory.

We now deal with any mid-block unused entries specially to avoid
using DIRSIZ() or bcopy() on such entries. We also ensure that the
variables 'dsize' and 'spacefree' contain meaningful values at all
times. Add a few comments to describe better this intricate piece
of code.

The special handling of mid-block unused entries makes the dirhash-
specific bugfix in the previous revision (1.53) now uncecessary,
so this change removes it.

Reviewed by: mckusick


82124 22-Aug-2001 iedowse

When compressing directory blocks, the dirhash code didn't check
that the directory entry was in use before attempting to find it
in the hash structures to change its offset. Normally, unused
entries do not need to be moved, but fsck can leave behind some
unused entries that do. A dirhash sanity panic resulted when the
entry to be moved was not found. Add a check that stops entries
with d_ino == 0 from being passed to ufsdirhash_move().


81877 18-Aug-2001 peter

Sigh. ufs_lookup() calls ffs_snapgone(), meaning that 'options EXT2FS'
without 'options FFS' would fail to link.


80554 29-Jul-2001 iedowse

Two recent commits in sys/ufs/ufs interacted badly with ext2fs
because it shares ufs code. In ufs_fhtovp(), the test on i_effnlink
is invalid because ext2fs does not maintain this field. In ufs_close(),
i_effnlink is also tested, to determines whether or not to call
vn_start_write(). The ufs_fhtovp issue breaks NFS exporting of
ext2fs filesystems; I believe the other is harmless.

Fix both cases by checking um_i_effnlink_valid in the ufsmount
struct, and use i_nlink if necessary.

Noticed by: bde
Reviewed by: mckusick, bde


80456 27-Jul-2001 iedowse

Disable the dirhash sanity check that panics if an unused directory
entry (d_ino == 0) is found in a position that is not the start of
a DIRBLKSIZ block.

While such entries cannot occur normally (ufs always extends the
previous entry to cover the free space instead), they do not cause
problems and fsck does not fix them, so panicking is bad.


79769 16-Jul-2001 peter

Use a fixed type for times in on-disk structures for ufs rather than
something that could potentially change like time_t.


79690 13-Jul-2001 iedowse

Return a locked struct buf from ufsdirhash_lookup() to avoid one
extra getblk/brelse sequence for each lookup. We already had this
buf in ufsdirhash_lookup(), so there was no point in brelse'ing it
only to have the caller immediately reaquire the same buffer.

This should make the case of sequential lookups marginally faster;
in my tests, sequential lookups with dirhash enabled are now only
around 1% slower than without dirhash.


79561 10-Jul-2001 iedowse

Bring in dirhash, a simple hash-based lookup optimisation for large
directories. When enabled via "options UFS_DIRHASH", in-core hash
arrays are maintained for large directories. These allow all
directory operations to take place quickly instead of requiring
long linear searches. For now anyway, dirhash is not enabled by
default.

The in-core hash arrays have a memory requirement that is approximately
half the size of the size of the on-disk directory file. A number
of new sysctl variables allow control over which directories get
hashed and over the maximum amount of memory that dirhash will use:

vfs.ufs.dirhash_minsize
The minimum on-disk directory size for which hashing should be
used. The default is 2560 (2.5k).

vfs.ufs.dirhash_maxmem
The system-wide maximum total memory to be used by dirhash data
structures. The default is 2097152 (2MB).

The current amount of memory being used by dirhash is visible
through the read-only sysctl variable vfs.ufs.dirhash_maxmem.
Finally, some extra sanity checks that are enabled by default, but
which may have an impact on performance, can be disabled by setting
vfs.ufs.dirhash_docheck to 0.

Discussed on: -fs, -hackers


79224 04-Jul-2001 dillon

With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.


78940 28-Jun-2001 jhb

Fix more mntvnode and vnode interlock order reversals.


78912 28-Jun-2001 jhb

- Fix a mntvnode and vnode interlock reversal.
- Protect the mnt_vnode list with the mntvnode lock.
- Use queue(9) macros.


78256 15-Jun-2001 peter

Fix warning:
1973: warning: int format, long int arg (arg 5)


78191 13-Jun-2001 mckusick

Build on the change in revision 1.98 by Tor.Egge@fast.no.
The symptom being treated in 1.98 was to avoid freeing a
pagedep dependency if there was still a newdirblk dependency
referencing it. That change is correct and no longer prints
a warning message when it occurs. The other part of revision
1.98 was to panic when a newdirblk dependency was encountered
during a file truncation. This fix removes that panic and
replaces it with code to find and delete the newdirblk
dependency so that the truncation can succeed.


77847 07-Jun-2001 tmm

Call vn_close on the backing file vnode if ufs_extattr_enable failed to
avoid leaking it.

Reviewed by: rwatson


77822 06-Jun-2001 jlemon

Add a wrapper for the fifo kqfilter which falls through to the ufs routine.
This permits the fifo to inherit the ufs VNODE kqfilter.


77762 05-Jun-2001 jlemon

Add a kqueue filter for writing to ufs filesystems which always returns
true. This permits better interoperability with programs which register
filters on their stdin/stdout handles.

Submitted by: Niels Provos <provos@citi.umich.edu>


77743 05-Jun-2001 obrien

There seems to be a problem that the order of disk write operation being
incorrect due to a missing check for some dependency. This change
avoids the freelist corruption (but not the temporarily inconsistent
state of the file system).

A message is printed as a reminder of the under lying problem when a
pagedep structure is not freed due to the NEWBLOCK flag being set.

Submitted by: Tor.Egge@fast.no


77509 30-May-2001 jhb

Revert the previous commit in favor of the fix in rev 1.42 of
ufs/ffs/ffs_extern.h instead.

Requested by: bde


77508 30-May-2001 jhb

Forward declare struct cg to quiet a warning.

Submitted by: bde


77445 29-May-2001 jhb

Include <ufs/ffs/fs.h> to get the definition of struct cg to quiet a
warning.


77437 29-May-2001 phk

Remove last vestiges of MFS.


77417 29-May-2001 phk

Remove MFS from the kernel.


77190 25-May-2001 tmm

Add a check to determine whether extended attributes have been
initialized on the file system before trying to grab the lock of the
per-mount extattr structure, as this lock is unitialized in that case.
This is needed because ufs_extattr_vnode_inactive is called from
ufs_inactive, which is also used by EA-unaware file systems such as
ext2fs.

Reviewed by: rwatson


77183 25-May-2001 rwatson

o Merge contents of struct pcred into struct ucred. Specifically, add the
real uid, saved uid, real gid, and saved gid to ucred, as well as the
pcred->pc_uidinfo, which was associated with the real uid, only rename
it to cr_ruidinfo so as not to conflict with cr_uidinfo, which
corresponds to the effective uid.
o Remove p_cred from struct proc; add p_ucred to struct proc, replacing
original macro that pointed.
p->p_ucred to p->p_cred->pc_ucred.
o Universally update code so that it makes use of ucred instead of pcred,
p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo,
cr_{r,sv}{u,g}id instead of p_*, etc.
o Remove pcred0 and its initialization from init_main.c; initialize
cr_ruidinfo there.
o Restruction many credential modification chunks to always crdup while
we figure out locking and optimizations; generally speaking, this
means moving to a structure like this:
newcred = crdup(oldcred);
...
p->p_ucred = newcred;
crfree(oldcred);
It's not race-free, but better than nothing. There are also races
in sys_process.c, all inter-process authorization, fork, exec, and
exit.
o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid;
remove comments indicating that the old arrangement was a problem.
o Restructure exec1() a little to use newcred/oldcred arrangement, and
use improved uid management primitives.
o Clean up exit1() so as to do less work in credential cleanup due to
pcred removal.
o Clean up fork1() so as to do less work in credential cleanup and
allocation.
o Clean up ktrcanset() to take into account changes, and move to using
suser_xxx() instead of performing a direct uid==0 comparision.
o Improve commenting in various kern_prot.c credential modification
calls to better document current behavior. In a couple of places,
current behavior is a little questionable and we need to check
POSIX.1 to make sure it's "right". More commenting work still
remains to be done.
o Update credential management calls, such as crfree(), to take into
account new ruidinfo reference.
o Modify or add the following uid and gid helper routines:
change_euid()
change_egid()
change_ruid()
change_rgid()
change_svuid()
change_svgid()
In each case, the call now acts on a credential not a process, and as
such no longer requires more complicated process locking/etc. They
now assume the caller will do any necessary allocation of an
exclusive credential reference. Each is commented to document its
reference requirements.
o CANSIGIO() is simplified to require only credentials, not processes
and pcreds.
o Remove lots of (p_pcred==NULL) checks.
o Add an XXX to authorization code in nfs_lock.c, since it's
questionable, and needs to be considered carefully.
o Simplify posix4 authorization code to require only credentials, not
processes and pcreds. Note that this authorization, as well as
CANSIGIO(), needs to be updated to use the p_cansignal() and
p_cansched() centralized authorization routines, as they currently
do not take into account some desirable restrictions that are handled
by the centralized routines, as well as being inconsistent with other
similar authorization instances.
o Update libkvm to take these changes into account.

Obtained from: TrustedBSD Project
Reviewed by: green, bde, jhb, freebsd-arch, freebsd-audit


77115 24-May-2001 dillon

This patch implements O_DIRECT about 80% of the way. It takes a patchset
Tor created a while ago, removes the raw I/O piece (that has cache coherency
problems), and adds a buffer cache / VM freeing piece.

Essentially this patch causes O_DIRECT I/O to not be left in the cache, but
does not prevent it from going through the cache, hence the 80%. For
the last 20% we need a method by which the I/O can be issued directly to
buffer supplied by the user process and bypass the buffer cache entirely,
but still maintain cache coherency.

I also have the code working under -stable but the changes made to sys/file.h
may not be MFCable, so an MFC is not on the table yet.

Submitted by: tegge, dillon


77037 23-May-2001 alfred

ufs_bmaparray() may block on IO, drop vm mutex and aquire Giant when
calling it from the pager routine


77031 23-May-2001 ru

- FDESC, FIFO, NULL, PORTAL, PROC, UMAP and UNION file
systems were repo-copied from sys/miscfs to sys/fs.

- Renamed the following file systems and their modules:
fdesc -> fdescfs, portal -> portalfs, union -> unionfs.

- Renamed corresponding kernel options:
FDESC -> FDESCFS, PORTAL -> PORTALFS, UNION -> UNIONFS.

- Install header files for the above file systems.

- Removed bogus -I${.CURDIR}/../../sys CFLAGS from userland
Makefiles.


76900 20-May-2001 mckusick

Update softdep_setup_directory_add prototype to reflect changes in
actual function.

Obtained from: Jim Bloom <bloom@jbloom.jbloom.org>


76859 19-May-2001 mckusick

Must ensure that all the entries on the pd_pendinghd list have been
committed to disk before clearing them. More specifically, when
free_newdirblk is called, we know that the inode claims the new
directory block. However, if the associated pagedep is still linked
onto the directory buffer dependency chain, then some of the entries
on the pd_pendinghd list may not be committed to disk yet. In this
case, we will simply note that the inode claims the block and let
the pd_pendinghd list be processed when the pagedep is next written.
If the pagedep is no longer on the buffer dependency chain, then
all the entries on the pd_pending list are committed to disk and
we can free them in free_newdirblk. This corrects a window of
vulnerability introduced in the code added in version 1.95.


76827 19-May-2001 alfred

Introduce a global lock for the vm subsystem (vm_mtx).

vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb


76825 18-May-2001 mckusick

Must be a bit less aggressive about freeing pagedep structures.

Obtained from: Robert Watson <rwatson@FreeBSD.org> and
Matthew Jacob <mjacob@feral.com>


76724 17-May-2001 mckusick

When a new block is allocated to a directory, an fsync of a file
whose name is within that block must ensure not only that the block
containing the file name has been written, but also that the on-disk
directory inode references that block. When a new directory block
is created, we allocate a newdirblk structure which is linked to
the associated allocdirect (on its ad_newdirblk list). When the
allocdirect has been satisfied, the newdirblk structure is moved
to the inodedep id_bufwait list of its directory to await the inode
being written. When the inode is written, the directory entries
are fully committed and can be deleted from their pagedep->id_pendinghd
and inodedep->id_pendinghd lists.


76688 16-May-2001 iedowse

Change the second argument of vflush() to an integer that specifies
the number of references on the filesystem root vnode to be both
expected and released. Many filesystems hold an extra reference on
the filesystem root vnode, which must be accounted for when
determining if the filesystem is busy and then released if it isn't
busy. The old `skipvp' approach required individual filesystem
xxx_unmount functions to re-implement much of vflush()'s logic to
deal with the root vnode.

All 9 filesystems that hold an extra reference on the root vnode
got the logic wrong in the case of forced unmounts, so `umount -f'
would always fail if there were any extra root vnode references.
Fix this issue centrally in vflush(), now that we can.

This commit also fixes a vnode reference leak in devfs, which could
result in idle devfs filesystems that refuse to unmount.

Reviewed by: phk, bp


76580 14-May-2001 mckusick

Further fixes for deadlock in the presence of multiple snapshots.
There are still more to find, but this fix should cover the
common cases that folks are hitting.


76557 13-May-2001 mckusick

If the effective link count is zero when an NFS file handle request
comes in for it, the file is really gone, so return ESTALE.

The problem arises when the last reference to an FFS file is
released because soft-updates may delay the actual freeing of the
inode for some time. Since there are no filesystem links or open
file descriptors referencing the inode, from the point of view of
the system, the file is inaccessible. However, if the filesystem
is NFS exported, then the remote client can still access the inode
via ufs_fhtovp() until the inode really goes away. To prevent this
anomoly, it is necessary to begin returning ESTALE at the same time
that the file ceases to be accessible to the local filesystem.

Obtained from: Ian Dowse <iedowse@maths.tcd.ie>


76458 11-May-2001 mckusick

Remove yet another deadlock case.


76357 08-May-2001 mckusick

When running with soft updates, track the number of blocks and files
that are committed to being freed and reflect these blocks in the
counts returned by statfs (and thus also by the `df' command). This
change allows programs such as those that do news expiration to
know when to stop if they are trying to create a certain percentage
of free space. Note that this change does not solve the much harder
problem of making this to-be-freed space available to applications
that want it (thus on a nearly full filesystem, you may still
encounter out-of-space conditions even though the free space will
show up eventually). Hopefully this harder problem will be the
subject of a future enhancement.


76356 08-May-2001 mckusick

Several fixes for units errors:
1) Do not assume that the superblock will be of size fs->fs_bsize.
This fixes a panic when taking a snapshot on a filesystem with
a block size bigger than 8K.
2) Properly calculate the number of fragments that follow the
superblock summary information. This fixes a bug with inconsistent
snapshots.
3) When cleaning up a snapshot that is about to be removed, properly
calculate the number of blocks that need to be checked. This fixes
a bug that created partially allocated inodes.
4) When moving blocks from a snapshot that is about to be removed
to another snapshot, properly account for the reduced number of
blocks in the snapshot from which they are taken. This fixes a
bug in which the number of blocks released from a snapshot did not
match the number that it claimed to have.


76354 08-May-2001 mckusick

When syncing out snapshot metadata, we must temporarily allow recursive
buffer locking so as to avoid locking against ourselves if we need to
write filesystem metadata.


76269 04-May-2001 mckusick

Refinement to revision 1.16 of ufs/ffs/ffs_snapshot.c to reduce
the amount of time that the filesystem must be suspended. The
current snapshot is elided as well as the earlier snapshots.


76174 01-May-2001 phk

Use ufs_bmaparray() rather than VOP_BMAP() on our own vnodes.


76173 01-May-2001 phk

Remove blatantly pointless call to VOP_BMAP().

Use ufs_bmaparray() rather than VOP_BMAP() on our own vnodes.


76167 01-May-2001 phk

Implement vop_std{get|put}pages() and add them to the default vop[].

Un-copy&paste all the VOP_{GET|PUT}PAGES() functions which do nothing but
the default.


76166 01-May-2001 markm

Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by: bde (with reservations)


76132 29-Apr-2001 phk

VOP_BALLOC was never really a VOP in the first place, so convert it
to UFS_BALLOC like the other "between UFS and FFS function interfaces".


76131 29-Apr-2001 phk

Add a vop_stdbmap(), and make it part of the default vop vector.

Make 7 filesystems which don't really know about VOP_BMAP rely
on the default vector, rather than more or less complete local
vop_nopbmap() implementations.


76129 29-Apr-2001 phk

Call ufs_bmaparray() directly instead of indirectly via VOP_BMAP().


76128 29-Apr-2001 phk

Remove two unused arguments from ufs_bmaparray().


76127 29-Apr-2001 phk

Remove faint traces of blind copy&paste.


76126 29-Apr-2001 phk

Remove faint traces of non-existant ffs_bmap().


76117 29-Apr-2001 grog

Revert consequences of changes to mount.h, part 2.

Requested by: bde


75993 26-Apr-2001 mckusick

Rather than copying all the indirect blocks of the snapshot,
simply mark them as BLK_NOCOPY. This trick cuts the initial
size of the snapshot in half and cuts the time to take a
snapshot by a third.


75943 25-Apr-2001 mckusick

When closing the last reference to an unlinked file, it is freed
by the inactive routine. Because the freeing causes the filesystem
to be modified, the close must be held up during periods when the
filesystem is suspended.

For snapshots to be consistent across crashes, they must write
blocks that they copy and claim those written blocks in their
on-disk block pointers before the old blocks that they referenced
can be allowed to be written.

Close a loophole that allowed unwritten blocks to be skipped when
doing ffs_sync with a request to wait for all I/O activity to be
completed.


75934 25-Apr-2001 phk

Move the netexport structure from the fs-specific mountstructure
to struct mount.

This makes the "struct netexport *" paramter to the vfs_export
and vfs_checkexport interface unneeded.

Consequently that all non-stacking filesystems can use
vfs_stdcheckexp().

At the same time, make it a pointer to a struct netexport
in struct mount, so that we can remove the bogus AF_MAX
and #include <net/radix.h> from <sys/mount.h>


75892 24-Apr-2001 iedowse

Pre-dirpref versions of fsck may zero out the new superblock fields
fs_contigdirs, fs_avgfilesize and fs_avgfpdir. This could cause
panics if these fields were zeroed while a filesystem was mounted
read-only, and then remounted read-write.

Add code to ffs_reload() which copies the fs_contigdirs pointer
from the previous superblock, and reinitialises fs_avgf* if necessary.

Reviewed by: mckusick


75858 23-Apr-2001 grog

Correct #includes to work with fixed sys/mount.h.


75580 17-Apr-2001 phk

This patch removes the VOP_BWRITE() vector.

VOP_BWRITE() was a hack which made it possible for NFS client
side to use struct buf with non-bio backing.

This patch takes a more general approach and adds a bp->b_op
vector where more methods can be added.

The success of this patch depends on bp->b_op being initialized
all relevant places for some value of "relevant" which is not
easy to determine. For now the buffers have grown a b_magic
element which will make such issues a tiny bit easier to debug.


75573 17-Apr-2001 mckusick

Add debugging option to always read/write cylinder groups as full
sized blocks. To enable this option, use: `sysctl -w debug.bigcgs=1'.
Add debugging option to disable background writes of cylinder
groups. To enable this option, use: `sysctl -w debug.dobkgrdwrite=0'.
These debugging options should be tried on systems that are panicing
with corrupted cylinder group maps to see if it makes the problem
go away. The set of panics in question are:

ffs_clusteralloc: map mismatch
ffs_nodealloccg: map corrupted
ffs_nodealloccg: block not in map
ffs_alloccg: map corrupted
ffs_alloccg: block not in map
ffs_alloccgblk: cyl groups corrupted
ffs_alloccgblk: can't find blk in cyl
ffs_checkblk: partially free fragment

The following panics are less likely to be related to this problem,
but might be helped by these debugging options:

ffs_valloc: dup alloc
ffs_blkfree: freeing free block
ffs_blkfree: freeing free frag
ffs_vfree: freeing free inode

If you try these options, please report whether they helped reduce your
bitmap corruption panics to Kirk McKusick at <mckusick@mckusick.com>
and to Matt Dillon <dillon@earth.backplane.com>.


75572 17-Apr-2001 mckusick

Background fsck sysctl operations must use vn_start_write and
vn_finished_write so that they do not attempt to modify a
suspended filesystem.


75571 17-Apr-2001 rwatson

In my first reading of POSIX.1e, I misinterpreted handling of the
ACL_USER_OBJ and ACL_GROUP_OBJ fields, believing that modification of the
access ACL could be used by privileged processes to change file/directory
ownership. In fact, this is incorrect; ACL_*_OBJ (+ ACL_MASK and
ACL_OTHER) should have undefined ae_id fields; this commit attempts
to correct that misunderstanding.

o Modify arguments to vaccess_acl_posix1e() to accept the uid and gid
associated with the vnode, as those can no longer be extracted from
the ACL passed as an argument. Perform all comparisons against
the passed arguments. This actually has the effect of simplifying
a number of components of this call, as well as reducing the indent
level, but now seperates handling of ACL_GROUP_OBJ from ACL_GROUP.

o Modify acl_posix1e_check() to return EINVAL if the ae_id field of
any of the ACL_{USER_OBJ,GROUP_OBJ,MASK,OTHER} entries is a value
other than ACL_UNDEFINED_ID. As a temporary work-around to allow
clean upgrades, set the ae_id field to ACL_UNDEFINED_ID before
each check so that this cannot cause a failure in the short term
(this work-around will be removed when the userland libraries and
utilities are updated to take this change into account).

o Modify ufs_sync_acl_from_inode() so that it forces
ACL_{USER_OBJ,GROUP_OBJ,MASK,OTHER} ae_id fields to ACL_UNDEFINED_ID
when synchronizing the ACL from the inode.

o Modify ufs_sync_inode_from_acl to not propagate uid and gid
information to the inode from the ACL during ACL update. Also
modify the masking of permission bits that may be set from
ALLPERMS to (S_IRWXU|S_IRWXG|S_IRWXO), as ACLs currently do not
carry none-ACCESSPERMS (S_ISUID, S_ISGID, S_ISTXT).

o Modify ufs_getacl() so that when it emulates an access ACL from
the inode, it initializes the ae_id fields to ACL_UNDEFINED_ID.

o Clean up ufs_setacl() substantially since it is no longer possible
to perform chown/chgrp operations using vop_setacl(), so all the
access control for that can be eliminated.

o Modify ufs_access() so that it passes owner uid and gid information
into vaccess_acl_posix1e().

Pointed out by: jedger
Obtained from: TrustedBSD Project


75515 14-Apr-2001 mckusick

Update to describe use of mdconfig instead of deprecated vnconfig.

Submitted by: Steve Ames <steve@virtual-voodoo.com>


75503 14-Apr-2001 mckusick

This checkin adds support in ufs/ffs for the FS_NEEDSFSCK flag.
It is described in ufs/ffs/fs.h as follows:

/*
* Filesystem flags.
*
* Note that the FS_NEEDSFSCK flag is set and cleared only by the
* fsck utility. It is set when background fsck finds an unexpected
* inconsistency which requires a traditional foreground fsck to be
* run. Such inconsistencies should only be found after an uncorrectable
* disk error. A foreground fsck will clear the FS_NEEDSFSCK flag when
* it has successfully cleaned up the filesystem. The kernel uses this
* flag to enforce that inconsistent filesystems be mounted read-only.
*/
#define FS_UNCLEAN 0x01 /* filesystem not clean at mount */
#define FS_DOSOFTDEP 0x02 /* filesystem using soft dependencies */
#define FS_NEEDSFSCK 0x04 /* filesystem needs sync fsck before mount */


75377 10-Apr-2001 mckusick

Directory layout preference improvements from Grigoriy Orlov <gluk@ptci.ru>.
His description of the problem and solution follow. My own tests show
speedups on typical filesystem intensive workloads of 5% to 12% which
is very impressive considering the small amount of code change involved.

------

One day I noticed that some file operations run much faster on
small file systems then on big ones. I've looked at the ffs
algorithms, thought about them, and redesigned the dirpref algorithm.

First I want to describe the results of my tests. These results are old
and I have improved the algorithm after these tests were done. Nevertheless
they show how big the perfomance speedup may be. I have done two file/directory
intensive tests on a two OpenBSD systems with old and new dirpref algorithm.
The first test is "tar -xzf ports.tar.gz", the second is "rm -rf ports".
The ports.tar.gz file is the ports collection from the OpenBSD 2.8 release.
It contains 6596 directories and 13868 files. The test systems are:

1. Celeron-450, 128Mb, two IDE drives, the system at wd0, file system for
test is at wd1. Size of test file system is 8 Gb, number of cg=991,
size of cg is 8m, block size = 8k, fragment size = 1k OpenBSD-current
from Dec 2000 with BUFCACHEPERCENT=35

2. PIII-600, 128Mb, two IBM DTLA-307045 IDE drives at i815e, the system
at wd0, file system for test is at wd1. Size of test file system is 40 Gb,
number of cg=5324, size of cg is 8m, block size = 8k, fragment size = 1k
OpenBSD-current from Dec 2000 with BUFCACHEPERCENT=50

You can get more info about the test systems and methods at:
http://www.ptci.ru/gluk/dirpref/old/dirpref.html

Test Results

tar -xzf ports.tar.gz rm -rf ports
mode old dirpref new dirpref speedup old dirprefnew dirpref speedup
First system
normal 667 472 1.41 477 331 1.44
async 285 144 1.98 130 14 9.29
sync 768 616 1.25 477 334 1.43
softdep 413 252 1.64 241 38 6.34
Second system
normal 329 81 4.06 263.5 93.5 2.81
async 302 25.7 11.75 112 2.26 49.56
sync 281 57.0 4.93 263 90.5 2.9
softdep 341 40.6 8.4 284 4.76 59.66

"old dirpref" and "new dirpref" columns give a test time in seconds.
speedup - speed increasement in times, ie. old dirpref / new dirpref.

------

Algorithm description

The old dirpref algorithm is described in comments:

/*
* Find a cylinder to place a directory.
*
* The policy implemented by this algorithm is to select from
* among those cylinder groups with above the average number of
* free inodes, the one with the smallest number of directories.
*/

A new directory is allocated in a different cylinder groups than its
parent directory resulting in a directory tree that is spreaded across
all the cylinder groups. This spreading out results in a non-optimal
access to the directories and files. When we have a small filesystem
it is not a problem but when the filesystem is big then perfomance
degradation becomes very apparent.

What I mean by a big file system ?

1. A big filesystem is a filesystem which occupy 20-30 or more percent
of total drive space, i.e. first and last cylinder are physically
located relatively far from each other.
2. It has a relatively large number of cylinder groups, for example
more cylinder groups than 50% of the buffers in the buffer cache.

The first results in long access times, while the second results in
many buffers being used by metadata operations. Such operations use
cylinder group blocks and on-disk inode blocks. The cylinder group
block (fs->fs_cblkno) contains struct cg, inode and block bit maps.
It is 2k in size for the default filesystem parameters. If new and
parent directories are located in different cylinder groups then the
system performs more input/output operations and uses more buffers.
On filesystems with many cylinder groups, lots of cache buffers are
used for metadata operations.

My solution for this problem is very simple. I allocate many directories
in one cylinder group. I also do some things, so that the new allocation
method does not cause excessive fragmentation and all directory inodes
will not be located at a location far from its file's inodes and data.
The algorithm is:
/*
* Find a cylinder group to place a directory.
*
* The policy implemented by this algorithm is to allocate a
* directory inode in the same cylinder group as its parent
* directory, but also to reserve space for its files inodes
* and data. Restrict the number of directories which may be
* allocated one after another in the same cylinder group
* without intervening allocation of files.
*
* If we allocate a first level directory then force allocation
* in another cylinder group.
*/

My early versions of dirpref give me a good results for a wide range of
file operations and different filesystem capacities except one case:
those applications that create their entire directory structure first
and only later fill this structure with files.

My solution for such and similar cases is to limit a number of
directories which may be created one after another in the same cylinder
group without intervening file creations. For this purpose, I allocate
an array of counters at mount time. This array is linked to the superblock
fs->fs_contigdirs[cg]. Each time a directory is created the counter
increases and each time a file is created the counter decreases. A 60Gb
filesystem with 8mb/cg requires 10kb of memory for the counters array.

The maxcontigdirs is a maximum number of directories which may be created
without an intervening file creation. I found in my tests that the best
performance occurs when I restrict the number of directories in one cylinder
group such that all its files may be located in the same cylinder group.
There may be some deterioration in performance if all the file inodes
are in the same cylinder group as its containing directory, but their
data partially resides in a different cylinder group. The maxcontigdirs
value is calculated to try to prevent this condition. Since there is
no way to know how many files and directories will be allocated later
I added two optimization parameters in superblock/tunefs. They are:

int32_t fs_avgfilesize; /* expected average file size */
int32_t fs_avgfpdir; /* expected # of files per directory */

These parameters have reasonable defaults but may be tweeked for special
uses of a filesystem. They are only necessary in rare cases like better
tuning a filesystem being used to store a squid cache.

I have been using this algorithm for about 3 months. I have done
a lot of testing on filesystems with different capacities, average
filesize, average number of files per directory, and so on. I think
this algorithm has no negative impact on filesystem perfomance. It
works better than the default one in all cases. The new dirpref
will greatly improve untarring/removing/coping of big directories,
decrease load on cvs servers and much more. The new dirpref doesn't
speedup a compilation process, but also doesn't slow it down.

Obtained from: Grigoriy Orlov <gluk@ptci.ru>


75138 03-Apr-2001 rwatson

o Indent sub-section headings to be consistent with README.extattr.

Obtained from: TrustedBSD Project


75134 03-Apr-2001 rwatson

o Introduce a README file describing briefly how to use access control
lists, in the style of FFS README files for soft updates and snapshots.

Obtained from: TrustedBSD Project


75133 03-Apr-2001 rwatson

o Introduce a README file describing briefly how to use extended
attributes, in the style of FFS README files for soft updates and
snapshots.

Obtained from: TrustedBSD Project


75106 03-Apr-2001 rwatson

o Change the default from using IO_SYNC on EA set and delete operations
to not using IO_SYNC. Expose a sysctl (debug.ufs_extattr_sync) for
enabling the use of IO_SYNC.

- Use of IO_SYNC substantially degrades ACL performance when a
default ACL is set on a directory, as there are four synchronous
writes initiated to define both supporting EAs for new
sub-directories, and to set the data; two for new files. Later, this
may be optimized to two writes for sub-directories, one for new
files.

- IO_SYNC does not substantially improve consistency properties due
to the poor consistency properties of existing permissions (which
ACLs are a superset of), due to interaction with soft updates,
and due to differences in handling consistency for data and file
system meta-data.

- In macro-benchmarks, this reduces the overhead of setting default
ACLs down to the same overhead as enabling ACLs on a file system
and not using them. Enabling ACLs still introduces a small
overhead (I measure 7% on a -j 2 buildworld with pre-allocated
EA backing store, but this is not rigorous testing, nor in any way
optimized).

- The sysctl will probably change to another administration method
(or at least, a better name) in the near future, but consistency
properties of EAs are still being worked out. The toggle is defined
right now to allow easier performance analysis and exploration
of possible guarantees.

Obtained from: TrustedBSD Project


75077 02-Apr-2001 rwatson

o Correct an ACL implementation bug that could result in a system panic
under heavy use when default ACLs were bgin inherited by new files
or directories. This is done by removing a bug in default ACL
reading, and improving error handling for this failure case:

- Move the setting of the buffer length (len) variable to above the
ACL type (ap->a_type) switch rather than having it only for
ACL_TYPE_ACCESS. Otherwise, the len variable is unitialized in
the ACL_TYPE_DEFAULT case, which generally worked right, but could
result in failure.

- Add a check for a short/long read of the ACL_TYPE_DEFAULT type from
the underlying EA, resulting in EPERM rather than passing a
potentially corrupted ACL back to the caller (resulting "cleaner"
failures if the EA is damaged: right now, the caller will almost
always panic in the presence of a corrupted EA). This code is similar
to code in the ACL_TYPE_ACCESS handling in the previous switch case.

- While I'm fixing this code, remove a redundant bzero() of the ACL
reader buffer; it need only be initialized above the acl_type
switch.

Obtained from: TrustedBSD Project


74822 26-Mar-2001 rwatson

Introduce support for POSIX.1e ACLs on UFS-based file systems. This
implementation is still experimental, and while fairly broadly tested,
is not yet intended for production use. Support for POSIX.1e ACLs on
UFS will not be MFC'd to RELENG_4.

This implementation works by providing implementations of VOP_[GS]ETACL()
for FFS, as well as modifying the appropriate access control and file
creation routines. In this implementation, ACLs are backed into extended
attributes; the base ACL (owner, group, other) permissions remain in the
inode for performance and compatibility reasons, so only the extended and
default ACLs are placed in extended attributes. The logic for ACL
evaluation is provided by the fs-independent kern/kern_acl.c.

o Introduce UFS_ACL, a compile-time configuration option that enables
support for ACLs on FFS (and potentially other UFS-based file systems).
o Introduce ufs_getacl(), ufs_setacl(), ufs_aclcheck(), which
respectively get, set, and check the ACLs on the passed vnode.
o Introduce ufs_sync_acl_from_inode(), ufs_sync_inode_from_acl() to
maintain access control information between inode permissions and
extended attribute data.
o Modify ufs_access() to load a file access ACL and invoke
vaccess_acl_posix1e() if ACLs are available on the file system
o Modify ufs_mkdir() and ufs_makeinode() to associate ACLs with newly
created directories and files, inheriting from the parent directory's
default ACL.
o Enable these new vnode operations and conditionally compiled code
paths if UFS_ACL is defined.

A few notes:

o This implementation is fairly widely tested, but still should be
considered experimental.
o Currently, ACLs are not exported via NFS, instead, the summarizing
file mode/etc from the inode is. This results in conservative
protection behavior, similar to the behavior of ACL-nonaware programs
acting locally.
o It is possible that underlying binary data formats associated with
this implementation may change. Consumers of the implementation
should expect to find their local configuration obsoleted in the
next few months, resulting in possible loss of ACL data during an
upgrade.
o The extended attributes interface and implementation is still
undergoing modification to address portable interface concerns, as
well as performance.
o Many applications do not yet correctly handle ACLs. In general,
due to the POSIX.1e ACL model, behavior of ACL-unaware applications
will be conservative with respects to file protection; some caution
is recommended.
o Instructions for configuring and maintaining ACLs on UFS will be
committed in the near future; in the mean time it is possible to
reference the README included in the last UFS ACL distribution
placed in the TrustedBSD web site:

http://www.TrustedBSD.org/downloads/

Substantial debugging, hardware, travel, or connectivity support for this
project was provided by: BSDi, Safeport Network Services, and NAI Labs.
Significant coding contributions were made by Chris Faulhaber. Additional
support was provided by Brian Feldman, Thomas Moestl, and Ilmar Habibulin.

Reviewed by: jedgar, keichii, mckusick, trustedbsd-discuss, freebsd-fs
Obtained from: TrustedBSD Project


74810 26-Mar-2001 phk

Send the remains (such as I have located) of "block major numbers" to
the bit-bucket.


74747 24-Mar-2001 asmodai

Fix typo ); -> ,


74705 23-Mar-2001 mckusick

Check that background fsck operation is being done on a ufs filesystem.

Obtained from: Robert Watson <rwatson@FreeBSD.org>


74608 21-Mar-2001 rwatson

o Remove an unnecessary debugging printf from ufs_extattr_lookup(),
which resulted in the output of warning messages at boot if
UFS_EXTATTR_AUTOSTART was enabled but ".attribute" and possible
sub-directories weren't in a mounted MFS or UFS file systems.

Pointed out by: dcs
Obtained from: TrustedBSD Project


74548 21-Mar-2001 mckusick

Add kernel support for running fsck on active filesystems.


74547 21-Mar-2001 mckusick

Clear the fs_clean flag only when the FS_UNCLEAN flag is not set
(as is done in unmount).

Remove a snapshot inode from the superblock list when its last
name goes away rather than when its last reference goes away.
That way it will be properly reclaimed by fsck after a crash
rather than reenabled when the filesystem is mounted.


74545 21-Mar-2001 mckusick

Report the correct inode number when panicing with freeing free inode.
Report the correct block number when panicing with freeing free block.


74442 19-Mar-2001 rwatson

o Enable UFS-based extended attribute support on MFS. Note that this change
is under-tested, and that MFS appears to be in the process of being
deprecated in favor of FFS over md. Note also that UFS_EXTATTR_AUTOSTART
doesn't make much sense on MFS unless the MFSROOT is compiled in, so
manual configuration is generally required.

Obtained from: TrustedBSD Project


74437 19-Mar-2001 rwatson

o Rename "namespace" argument to "attrnamespace" as namespace is a C++
reserved word.

Submitted by: jkh
Obtained from: TrustedBSD Project


74433 19-Mar-2001 rwatson

o Change options FFS_EXTATTR and options FFS_EXTATTR_AUTOSTART to
options UFS_EXTATTR and UFS_EXTATTR_AUTOSTART respectively. This change
reflects the fact that our EA support is implemented entirely at the
UFS layer (modulo FFS start/stop/autostart hooks for mount and unmount
events). This also better reflects the fact that [shortly] MFS will also
support EAs, as well as possibly IFS.

o Consumers of the EA support in FFS are reminded that as a result, they
must change kernel config files to reflect the new option names.

Obtained from: TrustedBSD Project


74404 18-Mar-2001 rwatson

o Caused FFS_EXTATTR_AUTOSTART to scan two sub-directories of ".attribute"
off of the file system root: "user" for user attributes, and "system"
for system attributes. When the scan occurs, attribute backing files
discovered in those directories will be started in the respective
namespaces. This re-introduces support for auto-starting of user
attributes, which was removed when the "$" prefix for system attributes
was replaced with explicit namespacing.

For users of the TrustedBSD UFS POSIX.1e ACL code, you'll need to:
mv ${FSROOT}/'$posix1e.acl_access' ${FSROOT}/system/posix1e.acl_access
mv ${FSROOT}/'$posix1e.acl_default' ${FSROOT}/system/posix1e.acl_default

For users of the TrustedBSD POSIX.1e Capability code, you'll need to:
mv ${FSROOT}/'$posix1e.cap' ${FSROOT}/system/posix1e.cap

For users of the TrustedBSD MAC code, you'll need to:
mv ${FSROOT}/'$freebsd.mac' ${FSROOT}/system/freebsd.mac

Updated versions of relevant patches will be released in the near
future.

Obtained from: TrustedBSD Project


74273 15-Mar-2001 rwatson

o Change the API and ABI of the Extended Attribute kernel interfaces to
introduce a new argument, "namespace", rather than relying on a first-
character namespace indicator. This is in line with more recent
thinking on EA interfaces on various mailing lists, including the
posix1e, Linux acl-devel, and trustedbsd-discuss forums. Two namespaces
are defined by default, EXTATTR_NAMESPACE_SYSTEM and
EXTATTR_NAMESPACE_USER, where the primary distinction lies in the
access control model: user EAs are accessible based on the normal
MAC and DAC file/directory protections, and system attributes are
limited to kernel-originated or appropriately privileged userland
requests.

o These API changes occur at several levels: the namespace argument is
introduced in the extattr_{get,set}_file() system call interfaces,
at the vnode operation level in the vop_{get,set}extattr() interfaces,
and in the UFS extended attribute implementation. Changes are also
introduced in the VFS extattrctl() interface (system call, VFS,
and UFS implementation), where the arguments are modified to include
a namespace field, as well as modified to advoid direct access to
userspace variables from below the VFS layer (in the style of recent
changes to mount by adrian@FreeBSD.org). This required some cleanup
and bug fixing regarding VFS locks and the VFS interface, as a vnode
pointer may now be optionally submitted to the VFS_EXTATTRCTL()
call. Updated documentation for the VFS interface will be committed
shortly.

o In the near future, the auto-starting feature will be updated to
search two sub-directories to the ".attribute" directory in appropriate
file systems: "user" and "system" to locate attributes intended for
those namespaces, as the single filename is no longer sufficient
to indicate what namespace the attribute is intended for. Until this
is committed, all attributes auto-started by UFS will be placed in
the EXTATTR_NAMESPACE_SYSTEM namespace.

o The default POSIX.1e attribute names for ACLs and Capabilities have
been updated to no longer include the '$' in their filename. As such,
if you're using these features, you'll need to rename the attribute
backing files to the same names without '$' symbols in front.

o Note that these changes will require changes in userland, which will
be committed shortly. These include modifications to the extended
attribute utilities, as well as to libutil for new namespace
string conversion routines. Once the matching userland changes are
committed, a buildworld is recommended to update all the necessary
include files and verify that the kernel and userland environments
are in sync. Note: If you do not use extended attributes (most people
won't), upgrading is not imperative although since the system call
API has changed, the new userland extended attribute code will no longer
compile with old include files.

o Couple of minor cleanups while I'm there: make more code compilation
conditional on FFS_EXTATTR, which should recover a bit of space on
kernels running without EA's, as well as update copyright dates.

Obtained from: TrustedBSD Project


74256 14-Mar-2001 rwatson

o In my merge, missed the one-line patch to ufs_vnops.c that removed
the static prototype for ufs_readdir(). Note that ufs_readdir() was
actually already non-static, the prototype was incorrect.

Submitted by: jedgar


74234 14-Mar-2001 rwatson

o Implement "options FFS_EXTATTR_AUTOSTART", which depends on
"options FFS_EXTATTR". When extended attribute auto-starting
is enabled, FFS will scan the .attribute directory off of the
root of each file system, as it is mounted. If .attribute
exists, EA support will be started for the file system. If
there are files in the directory, FFS will attempt to start
them as attribute backing files for attributes baring the same
name. All attributes are started before access to the file
system is permitted, so this permits race-free enabling of
attributes. For attributes backing support for security
features, such as ACLs, MAC, Capabilities, this is vital, as
it prevents the file system attributes from getting out of
sync as a result of file system operations between mount-time
and the enabling of the extended attribute. The userland
extattrctl tool will still function exactly as previously.
Files must be placed directly in .attribute, which must be
directly off of the file system root: symbolic links are
not permitted. FFS_EXTATTR will continue to be able
to function without FFS_EXTATTR_AUTOSTART for sites that do not
want/require auto-starting. If you're using the UFS_ACL code
available from www.TrustedBSD.org, using FFS_EXTATTR_AUTOSTART
is recommended.

o This support is implemented by adding an invocation of
ufs_extattr_autostart() to ffs_mountfs(). In addition,
several new supporting calls are introduced in
ufs_extattr.c:

ufs_extattr_autostart(): start EAs on the specified mount
ufs_extattr_lookup(): given a directory and filename,
return the vnode for the file.
ufs_extattr_enable_with_open(): invoke ufs_extattr_enable()
after doing the equililent of vn_open()
on the passed file.
ufs_extattr_iterate_directory(): iterate over a directory,
invoking ufs_extattr_lookup() and
ufs_extattr_enable_with_open() on each
entry.

o This feature is not widely tested, and therefore may contain
bugs, caution is advised. Several changes are in the pipeline
for this feature, including breaking out of EA namespaces into
subdirectories of .attribute (this is waiting on the updated
EA API), as well as a per-filesystem flag indicating whether
or not EAs should be auto-started. This is required because
administrators may not want .attribute auto-started on all
file systems, especially if non-administrators have write access
to the root of a file system.

Obtained from: TrustedBSD Project


73942 07-Mar-2001 mckusick

Fixes to track snapshot copy-on-write checking in the specinfo
structure rather than assuming that the device vnode would reside
in the FFS filesystem (which is obviously a broken assumption with
the device filesystem).


73929 07-Mar-2001 jhb

Grab the process lock while calling psignal and before calling psignal.


73928 07-Mar-2001 jhb

Protect SIGDELSET of p_siglist with the proc lock.


73287 01-Mar-2001 mckusick

Free lock before returning from process_worklist_item.

Obtained from: Constantine Sapuntzakis <csapuntz@stanford.edu>


73286 01-Mar-2001 adrian

Reviewed by: jlemon

An initial tidyup of the mount() syscall and VFS mount code.

This code replaces the earlier work done by jlemon in an attempt to
make linux_mount() work.

* the guts of the mount work has been moved into vfs_mount().

* move `type', `path' and `flags' from being userland variables into being
kernel variables in vfs_mount(). `data' remains a pointer into
userspace.

* Attempt to verify the `type' and `path' strings passed to vfs_mount()
aren't too long.

* rework mount() and linux_mount() to take the userland parameters
(besides data, as mentioned) and pass kernel variables to vfs_mount().
(linux_mount() already did this, I've just tidied it up a little more.)

* remove the copyin*() stuff for `path'. `data' still requires copyin*()
since its a pointer into userland.

* set `mount->mnt_statf_mntonname' in vfs_mount() rather than in each
filesystem. This variable is generally initialised with `path', and
each filesystem can override it if they want to.

* NOTE: f_mntonname is intiailised with "/" in the case of a root mount.


72956 23-Feb-2001 jlemon

Add a NOTE_REVOKE flag for vnodes, which is triggered from within vclean().
Use this to tell a filter attached to a vnode that the underlying vnode is
no longer valid, by returning EV_EOF.

PR: kern/25309, kern/25206


72953 23-Feb-2001 jlemon

Use correct list pointer when detaching knote from list.


72941 23-Feb-2001 mckusick

Free lock before calling panic so that subsequent attempt to write out
buffers does not re-panic with `locking against myself'. This change
should not affect normal operations of soft updates in any way.


72872 22-Feb-2001 mckusick

When cleaning up excess inode dependencies, check for being done.

Reviewed by: Jan Koum <jkb@yahoo-inc.com>


72765 20-Feb-2001 mckusick

This patch corrects two problems with the rate limiting code
that was introduced in revision 1.80. The problem manifested
itself with a `locking against myself' panic and could also
result in soft updates inconsistences associated with inodedeps.
The two problems are:

1) One of the background operations could manipulate the bitmap
while holding it locked with intent to create. This held lock
results in a `locking against myself' panic, when the background
processing that we have been coopted to do tries to lock the bitmap
which we are already holding locked. To understand how to fix this
problem, first, observe that we can do the background cleanups in
inodedep_lookup only when allocating inodedeps (DEPALLOC is set in
the call to inodedep_lookup). Second observe that calls to
inodedep_lookup with DEPALLOC set can only happen from the following
calls into the softdep code:

softdep_setup_inomapdep
softdep_setup_allocdirect
softdep_setup_remove
softdep_setup_freeblocks
softdep_setup_directory_change
softdep_setup_directory_add
softdep_change_linkcnt

Only the first two of these can come from ffs_alloc.c while holding
a bitmap locked. Thus, inodedep_lookup must not go off to do
request_cleanups when being called from these functions. This change
adds a flag, NODELAY, that can be passed to inodedep_lookup to let
it know that it should not do background processing in those cases.

2) The return value from request_cleanup when helping out with the
cleanup was 0 instead of 1. This meant that despite the fact that
we may have slept while doing the cleanups, the code did not recheck
for the appearance of an inodedep (e.g., goto top in inodedep_lookup).
This lead to the softdep inconsistency in which we ended up with
two inodedep's for the same inode.

Reviewed by: Peter Wemm <peter@yahoo-inc.com>,
Matt Dillon <dillon@earth.backplane.com>


72645 18-Feb-2001 asmodai

Preceed/preceeding are not english words. Use precede and preceding.


72521 15-Feb-2001 jlemon

Extend kqueue down to the device layer.

Backwards compatible approach suggested by: peter


72376 12-Feb-2001 jake

Implement a unified run queue and adjust priority levels accordingly.

- All processes go into the same array of queues, with different
scheduling classes using different portions of the array. This
allows user processes to have their priorities propogated up into
interrupt thread range if need be.
- I chose 64 run queues as an arbitrary number that is greater than
32. We used to have 4 separate arrays of 32 queues each, so this
may not be optimal. The new run queue code was written with this
in mind; changing the number of run queues only requires changing
constants in runq.h and adjusting the priority levels.
- The new run queue code takes the run queue as a parameter. This
is intended to be used to create per-cpu run queues. Implement
wrappers for compatibility with the old interface which pass in
the global run queue structure.
- Group the priority level, user priority, native priority (before
propogation) and the scheduling class into a struct priority.
- Change any hard coded priority levels that I found to use
symbolic constants (TTIPRI and TTOPRI).
- Remove the curpriority global variable and use that of curproc.
This was used to detect when a process' priority had lowered and
it should yield. We now effectively yield on every interrupt.
- Activate propogate_priority(). It should now have the desired
effect without needing to also propogate the scheduling class.
- Temporarily comment out the call to vm_page_zero_idle() in the
idle loop. It interfered with propogate_priority() because
the idle process needed to do a non-blocking acquire of Giant
and then other processes would try to propogate their priority
onto it. The idle process should not do anything except idle.
vm_page_zero_idle() will return in the form of an idle priority
kernel thread which is woken up at apprioriate times by the vm
system.
- Update struct kinfo_proc to the new priority interface. Deliberately
change its size by adjusting the spare fields. It remained the same
size, but the layout has changed, so userland processes that use it
would parse the data incorrectly. The size constraint should really
be changed to an arbitrary version number. Also add a debug.sizeof
sysctl node for struct kinfo_proc.


72200 09-Feb-2001 bmilekic

Change and clean the mutex lock interface.

mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)


72012 04-Feb-2001 phk

Another round of the <sys/queue.h> FOREACH transmogriffer.

Created with: sed(1)
Reviewed by: md5(1)


71999 04-Feb-2001 phk

Mechanical change to use <sys/queue.h> macro API instead of
fondling implementation details.

Created with: sed(1)
Reviewed by: md5(1)


71998 04-Feb-2001 phk

Use <sys/queue.h> macro API.


71993 04-Feb-2001 phk

Remove a DIAGNOSTIC check which belongs in <sys/queue.h> if anyplace at all.


71976 04-Feb-2001 iedowse

Extend the sanity checks in ufs_lookup to ensure that each directory
entry fits within its DIRBLKSIZ block. The surrounding code is
extremely fragile with respect to corruption of the directory entry
'd_reclen' field; if directory corruption occurs, it can blindly
scan forward beyond the end of the filesystem block. Usually this
results in a 'fault on nofault entry' panic.

Directory corruption is now much more likely to be detected, resulting
in a 'ufs_dirbad' panic. If the filesystem is read-only, it will
simply print a warning message, and skip the corrupted block.

Reviewed by: mckusick


71968 03-Feb-2001 iedowse

Use the correct flags field when checking for a read-only filesystem
in ufs_dirbad(). The mnt_stat.f_flags field is only updated by the
syscalls *statfs and getfsstat, so mnt_flag should be used instead.

This only affects whether or not a panic is generated on detection of
certain types of directory corruption.

Reviewed by: mckusick


71820 30-Jan-2001 dillon

Fix a race between the syncer and umount. When you umount a softupdates
filesystem softdep_process_worklist() is called in a loop until it indicates
that no dependancies remain, but the determination of that fact depends on
there only being one softdep_process_worklist() instance running. It was
possible for the syncer to also be running softdep_process_worklist()
and the pre-existing checks in the code to prevent this were not sufficient
to prevent the race. This patch solves the problem.

Approved-by: mckusick


71576 24-Jan-2001 jasone

Convert all simplelocks to mutexes and remove the simplelock implementations.


71073 15-Jan-2001 iedowse

The ffs superblock includes a 128-byte region for use by temporary
in-core pointers to summary information. An array in this region
(fs_csp) could overflow on filesystems with a very large number of
cylinder groups (~16000 on i386 with 8k blocks). When this happens,
other fields in the superblock get corrupted, and fsck refuses to
check the filesystem.

Solve this problem by replacing the fs_csp array in 'struct fs'
with a single pointer, and add padding to keep the length of the
128-byte region fixed. Update the kernel and userland utilities
to use just this single pointer.

With this change, the kernel no longer makes use of the superblock
fields 'fs_csshift' and 'fs_csmask'. Add a comment to newfs/mkfs.c
to indicate that these fields must be calculated for compatibility
with older kernels.

Reviewed by: mckusick


70980 12-Jan-2001 mckusick

Properly compute the size of the final block of superblock summary information.

Submitted by: Ian Dowse <iedowse@maths.tcd.ie>


70776 07-Jan-2001 rwatson

o Commit reems of style(9) changes, whitespace improvements, and comment
cleanups.

Obtained from: TrustedBSD Project


70774 07-Jan-2001 rwatson

o Zero the ufs_extattr_header length field (not necessary, but not a bad
idea either) in ufs_extattr_rm.
o More completely fill out the local_aio structure when writing out the
zero'd extended attribute in ufs_extattr_rm -- previoulsy, this worked
fine, but probably should not have. This corrects extraneous warnings
about inconsistent inodes following file deletion.

Reviewed by: jedgar


70773 07-Jan-2001 rwatson

o Add an additional EA inconsistency reporting opportunity in
ufs_extattr_rm.
o Make both reporting locations report the function name where the
inconsistency is discovered, as well as the inode number in question.

Reviewed by: jedgar


70767 07-Jan-2001 rwatson

o Make call to ufs_extattr_rm() in ufs_extattr_vnode_inactive() use
NULL as the credential, not 0, so as to make it more clear what's
going on.

Obtained from: TrustedBSD Project


70764 07-Jan-2001 rwatson

o Remove unnecessary sanity check involving requested offset of extended
attribute read--the offset is required to be 0 by an earlier check,
meaning that it will always be within the scope of the attribute data.
This change should have no impact on executed code paths other than
removing the unnecessary check: please report if any new failures
start to occur as a result.

Obtained from: TrustedBSD Project


70374 26-Dec-2000 dillon

This implements a better launder limiting solution. There was a solution
in 4.2-REL which I ripped out in -stable and -current when implementing the
low-memory handling solution. However, maxlaunder turns out to be the saving
grace in certain very heavily loaded systems (e.g. newsreader box). The new
algorithm limits the number of pages laundered in the first pageout daemon
pass. If that is not sufficient then suceessive will be run without any
limit.

Write I/O is now pipelined using two sysctls, vfs.lorunningspace and
vfs.hirunningspace. This prevents excessive buffered writes in the
disk queues which cause long (multi-second) delays for reads. It leads
to more stable (less jerky) and generally faster I/O streaming to disk
by allowing required read ops (e.g. for indirect blocks and such) to occur
without interrupting the write stream, amoung other things.

NOTE: eventually, filesystem write I/O pipelining needs to be done on a
per-device basis. At the moment it is globalized.


70183 19-Dec-2000 mckusick

Several small but important fixes for snapshots:

1) Be more tolerant of missing snapshot files by only trying to decrement
their reference count if they are registered as active.

2) Fix for snapshots of filesystems with block sizes larger than 8K
(from Ollivier Robert <roberto@eurocontrol.fr>).

3) Fix to avoid losing last block in snapshot file when calculating blocks
that need to be copied (from Don Coleman <coleman@coleman.org>).


70182 19-Dec-2000 mckusick

Get rid of spurious check in ffs_truncate for i_size == length
which fails to set the modification time on the file. The same
check a few lines later takes the correct action.

Submitted by: Ian Dowse <iedowse@maths.tcd.ie>


70132 17-Dec-2000 assar

add a stub for softdep_slowdown so that it's possible to build the
kernel without SOFTUPDATES


70131 17-Dec-2000 dillon

Avoid a data-consistency race between write() and mmap()
by ensuring that newly allocated blocks are zerod. The
race can occur even in the case where the write covers
the entire block.

Reported by: Sven Berkvens <sven@berkvens.net>, Marc Olzheim <zlo@zlo.nu>


70011 14-Dec-2000 tanimura

- Move ifs_init() so that it can initialize ifs_inode_hash_mtx.
- s/ffs_inode_hash_lock/ifs_inode_hash_lock/


69974 13-Dec-2000 tanimura

Do not race for the lock of an inode hash.

Reviewed by: jhb


69967 13-Dec-2000 mckusick

Preventing runaway kernel soft updates memory, take three.
Previously, the syncer process was the only process in the
system that could process the soft updates background work
list. If enough other processes were adding requests to that
list, it would eventually grow without bound. Because some of
the work list requests require vnodes to be locked, it was
not generally safe to let random processes process the work
list while they already held vnodes locked. By adding a flag
to the work list queue processing function to indicate whether
the calling process could safely lock vnodes, it becomes possible
to co-opt other processes into helping out with the work list.
Now when the worklist gets too large, other processes can safely
help out by picking off those work requests that can be handled
without locking a vnode, leaving only the small number of
requests requiring a vnode lock for the syncer process. With
this change, it appears possible to keep even the nastiest
workloads under control.

Submitted by: Paul Saab <ps@yahoo-inc.com>


69781 08-Dec-2000 dwmalone

Convert more malloc+bzero to malloc+M_ZERO.

Submitted by: josh@zipperup.org
Submitted by: Robert Drehmel <robd@gmx.net>


69774 08-Dec-2000 phk

Staticize some malloc M_ instances.


69686 06-Dec-2000 dillon

Add necessary bwillwrite() in writev() entry point.

Deal with excessive dirty buffers when msync() syncs non-contiguous
dirty buffers by checking for the case in UFS *before* checking for
clusterability.


68933 20-Nov-2000 mckusick

More aggressively rate limit the growth of soft dependency structures
in the face of multiple processes doing massive numbers of filesystem
operations. While this patch will work in nearly all situations, there
are still some perverse workloads that can overwhelm the system.
Detecting and handling these perverse workloads will be the subject
of another patch.

Reviewed by: Paul Saab <ps@yahoo-inc.com>
Obtained from: Ethan Solomita <ethan@geocast.com>


68885 18-Nov-2000 dillon

Implement a low-memory deadlock solution.

Removed most of the hacks that were trying to deal with low-memory
situations prior to now.

The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.

Code has been added to stall in a low-memory situation prior to a vnode
being locked.

Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.

Implement a number of VFS/BIO fixes

(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.

In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.

Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.

In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.

There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.

Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>


68715 14-Nov-2000 mckusick

When deleting a file, the ordering of events imposed by soft updates
is to first write the deleted directory entry to disk, second write
the zero'ed inode to disk, and finally to release the freed blocks
and the inode back to the cylinder-group map. As this ordering
requires two disk writes to occur which are normally spaced about
30 seconds apart (except when memory is under duress), it takes
about a minute from the time that a file is deleted until its inode
and data blocks show up in the cylinder-group map for reallocation.
If a file has had only a brief lifetime (less than 30 seconds from
creation to deletion), neither its inode nor its directory entry
may have been written to disk. If its directory entry has not been
written to disk, then we need not wait for that directory block to
be written as the on-disk directory block does not reference the
inode. Similarly, if the allocated inode has never been written to
disk, we do not have to wait for it to be written back either as
its on-disk representation is still zero'ed out. Thus, in the case
of a short lived file, we can simply release the blocks and inode
to the cylinder-group map immediately. As the inode and its blocks
are released immediately, they are immediately available for other
uses. If they are not released for a minute, then other inodes and
blocks must be allocated for short lived files, cluttering up the
vnode and buffer caches. The previous code was a bit too aggressive
in trying to release the blocks and inode back to the cylinder-group
map resulting in their being made available when in fact the inode
on disk had not yet been zero'ed. This patch takes a more conservative
approach to doing the release which avoids doing the release prematurely.


68307 04-Nov-2000 bde

Fixed breakage of mknod() in rev.1.48 of ext2_vnops.c and rev.1.126 of
ufs_vnops.c:

1) i_ino was confused with i_number, so the inode number passed to
VFS_VGET() was usually wrong (usually 0U).
2) ip was dereferenced after vgone() freed it, so the inode number
passed to VFS_VGET() was sometimes not even wrong.

Bug (1) was usually fatal in ext2_mknod(), since ext2fs doesn't have
space for inode 0 on the disk; ino_to_fsba() subtracts 1 from the
inode number, so inode number 0U gives a way out of bounds array
index. Bug(1) was usually harmless in ufs_mknod(); ino_to_fsba()
doesn't subtract 1, and VFS_VGET() reads suitable garbage (all 0's?)
from the disk for the invalid inode number 0U; ufs_mknod() returns
a wrong vnode, but most callers just vput() it; the correct vnode is
eventually obtained by an implicit VFS_VGET() just like it used to be.

Bug (2) usually doesn't happen.


68186 01-Nov-2000 eivind

Give vop_mmap an untimely death. The opportunity to give it a timely
death timed out in 1996.


68003 30-Oct-2000 phk

Add a missing <sys/systm.h>


67893 29-Oct-2000 phk

Move suser() and suser_xxx() prototypes and a related #define from
<sys/proc.h> to <sys/systm.h>.

Correctly document the #includes needed in the manpage.

Add one now needed #include of <sys/systm.h>.
Remove the consequent 48 unused #includes of <sys/proc.h>.


67885 29-Oct-2000 phk

Weaken a bogus dependency on <sys/proc.h> in <sys/buf.h> by #ifdef'ing
the offending inline function (BUF_KERNPROC) on it being #included
already.

I'm not sure BUF_KERNPROC() is even the right thing to do or in the
right place or implemented the right way (inline vs normal function).

Remove consequently unneeded #includes of <sys/proc.h>


67882 29-Oct-2000 phk

Remove unneeded #include <sys/proc.h> lines.


67309 19-Oct-2000 rwatson

o Introduce new VOP_ACCESS() flag VADMIN, allowing file systems to perform
"administrative" authorization checks. In most cases, the VADMIN test
checks to make sure the credential effective uid is the same as the file
owner.
o Modify vaccess() to set VADMIN as an available right if the uid is
appropriate.
o Modify references to uid-based access control operations such that they
now always invoke VOP_ACCESS() instead of using hard-coded policy checks.
o This allows alternative UFS policies to be implemented by replacing only
ufs_access() (such as mandatory system policies).
o VOP_ACCESS() requires the caller to hold an exclusive vnode lock on the
vnode: I believe that new invocations of VOP_ACCESS() are always called
with the lock held.
o Some direct checks of the uid remain, largely associated with the QUOTA
and SUIDDIR code.

Reviewed by: eivind
Obtained from: TrustedBSD Project


67106 14-Oct-2000 adrian

Initial commit of IFS - a inode-namespaced FFS. Here is a short
description:

How it works:
--

Basically ifs is a copy of ffs, overriding some vfs/vnops. (Yes, hack.)
I didn't see the need in duplicating all of sys/ufs/ffs to get this
off the ground.

File creation is done through a special file - 'newfile' . When newfile
is called, the system allocates and returns an inode. Note that newfile
is done in a cloning fashion:

fd = open("newfile", O_CREAT|O_RDWR, 0644);
fstat(fd, &st);

printf("new file is %d\n", (int)st.st_ino);

Once you have created a file, you can open() and unlink() it by its returned
inode number retrieved from the stat call, ie:

fd = open("5", O_RDWR);

The creation permissions depend entirely if you have write access to the
root directory of the filesystem.

To get the list of currently allocated inodes, VOP_READDIR has been added
which returns a directory listing of those currently allocated.

--

What this entails:

* patching conf/files and conf/options to include IFS as a new compile
option (and since ifs depends upon FFS, include the FFS routines)

* An entry in i386/conf/NOTES indicating IFS exists and where to go for
an explanation

* Unstaticize a couple of routines in src/sys/ufs/ffs/ which the IFS
routines require (ffs_mount() and ffs_reload())

* a new bunch of routines in src/sys/ufs/ifs/ which implement the IFS
routines. IFS replaces some of the vfsops, and a handful of vnops -
most notably are VFS_VGET(), VOP_LOOKUP(), VOP_UNLINK() and VOP_READDIR().
Any other directory operation is marked as invalid.

What this results in:

* an IFS partition's create permissions are controlled by the perm/ownership of
the root mount point, just like a normal directory

* Each inode has perm and ownership too

* IFS does *NOT* mean an FFS partition can be opened per inode. This is a
completely seperate filesystem here

* Softupdates doesn't work with IFS, and really I don't think it needs it.
Besides, fsck's are FAST. (Try it :-)

* Inodes 0 and 1 aren't allocatable because they are special (dump/swap IIRC).
Inode 2 isn't allocatable since UFS/FFS locks all inodes in the system against
this particular inode, and unravelling THAT code isn't trivial. Therefore,
useful inodes start at 3.

Enjoy, and feedback is definitely appreciated!


66893 09-Oct-2000 rwatson

o Sanity check was inverted, resulting in a possible spurious panic
during unmount if extended attributes were in use. Correct by removing
an unneeded (and undesirable) '!'.


66886 09-Oct-2000 eivind

Blow away the v_specmountpoint define, replacing it with what it was
defined as (rdev->si_mountpoint)


66753 06-Oct-2000 rwatson

o Move initialization of ump from mp to the top of the function so that
it is defined whenm used in ufs_extattr_uepm_destroy(), fixing a panic
due to a NULL pointer dereference.

Submitted by: Wesley Morgan <morganw@chemicals.tacorp.com>


66617 04-Oct-2000 rwatson

o Add call to ufs_extattr_uepm_destroy() in ffs_unmount() so as to clean
up lock on extattrs.
o Get for free a comment indicating where auto-starting of extended
attributes will eventually occur, as it was in my commit tree also.
No implementation change here, only a comment.


66616 04-Oct-2000 rwatson

o Correct use of lockdestroy() by adding a new ufs_extattr_uepm_destroy()
call, which should be the last thing down to a per-mount extattr
management structure, after ufs_extattr_stop() on the file system.
This currently has the effect only of destroying the per-mount lock
on extended attributes, and clearing appropriate flags.
o Remove inappropriate invocation in ufs_extattr_vnode_inactive().


66615 04-Oct-2000 jasone

Convert lockmgr locks from using simple locks to using mutexes.

Add lockdestroy() and appropriate invocations, which corresponds to
lockinit() and must be called to clean up after a lockmgr lock is no
longer needed.


66355 25-Sep-2000 bp

Add a lock structure to vnode structure. Previously it was either allocated
separately (nfs, cd9660 etc) or keept as a first element of structure
referenced by v_data pointer(ffs). Such organization leads to known problems
with stacked filesystems.

From this point vop_no*lock*() functions maintain only interlock lock.
vop_std*lock*() functions maintain built-in v_lock structure using lockmgr().
vop_sharedlock() is compatible with vop_stdunlock(), but maintains a shared
lock on vnode.

If filesystem wishes to export lockmgr compatible lock, it can put an address
of this lock to v_vnlock field. This indicates that the upper filesystem
can take advantage of it and use single lock structure for entire (or part)
of stack of vnodes. This field shouldn't be examined or modified by VFS code
except for initialization purposes.

Reviewed in general by: mckusick


66187 21-Sep-2000 rwatson

o Permit UFS Extended Attributes to be associated with special devices
and FIFOs.

Obtained from: TrustedBSD Project


66041 18-Sep-2000 rwatson

o Disallow privileged processes in jail() from directly accessing
system namespace extended attributes.
o Document privilege/jail() interaction relating to extended
attributes.

Obtained from: TrustedBSD Project


66040 18-Sep-2000 rwatson

o Allow privileged processes in jail() to override sticky bit behavior
on directories.
o Allow privileged processes in jail() to create inodes with the
setgid bit set even if they are not a member of the group denoted
by the file creation gid. This occurs due to inherited gid's from
parent directories on file creation, allowing a user to create a
file with a gid that is not in the creating process's credentials.

Obtained from: TrustedBSD Project


66039 18-Sep-2000 rwatson

o Add a comment clarifying interaction between jail(), privileged processes,
and UFS file flags. Here's what the comment says, for reference:

Privileged processes in jail() are permitted to modify
arbitrary user flags on files, but are not permitted
to modify system flags.

In other words, privilege does allow a process in jail to modify user
flags for objects that the process does not own, but privilege will
not permit the setting of system flags on the file.

Obtained from: TrustedBSD Project


66038 18-Sep-2000 rwatson

o Add missing PRISON_ROOT allowing a privileged process in a jail() to not
remove the setuid/setgid bits by virtue of a change to a file with those
bits set, even if the process doesn't own the file, or isn't a group
member of the file's gid.

Obtained from: TrustedBSD Project


66033 18-Sep-2000 rwatson

o Substitute suser() calls for direct credential checks, which is now
safe as suser() no longer sets ASU.
o Note that in some cases, the PRISON_ROOT flag is used even though no
process structure is passed, to indicate that if a process structure
(and hence jail) was available, it would be ok. In the long run,
the jail identifier should probably be moved to ucred, as the uidinfo
information was.
o Some uid 0 checks remain relating to the quota code, which I'll leave
for another day.

Reviewed by: phk, eivind
Obtained from: TrustedBSD Project


65998 17-Sep-2000 des

Silence a warning.


65973 17-Sep-2000 bp

Add new flag PDIRUNLOCK to the component.cn_flags which should be set by
filesystem lookup() routine if it unlocks parent directory. This flag should
be carefully tracked by filesystems if they want to work properly with nullfs
and other stacked filesystems.

VFS takes advantage of this flag to perform symantically correct usage
of vrele() instead of vput() if parent directory already unlocked.

If filesystem fails to track this flag then previous codepath in VFS left
unchanged.

Convert UFS code to set PDIRUNLOCK flag if necessary. Other filesystmes will
be changed after some period of testing.

Reviewed in general by: mckusick, dillon, adrian
Obtained from: NetBSD


65928 16-Sep-2000 phk

Remove a pointless casting of a gid_t to a gid_t.


65779 12-Sep-2000 bp

Add VOP_*VOBJECT vops, because MFS requires explicit vop specification.

Noted by: knu


65768 12-Sep-2000 rwatson

o Variety of extended attribute fixes
- In ufs_extattr_enable(), return EEXIST instead of EOPNOTSUPP
if the caller tries to configure an attribute name that is
already configured
- Throughout, add IO_NODELOCKED to VOP_{READ,WRITE} calls to
indicate lock status of passed vnode. Apparently not a
problem, but worth fixing.
- For all writes, make use of IO_SYNC consistent. Really,
IO_UNIT and combining of VOP_WRITE's should happen, but I
don't have that tested. At least with this, it's
consistent usage. (pointed out by: bde)
- In ufs_extattr_get(), fixed nested locking of backing
vnode (fine due to recursive lock support, but make it
more consistent with other code)
- In ufs_extattr_get(), clean up return code to set uio_resid
more consistently with other pieces of code (worked fine,
this is just a cleanup)
- Fix ufs_extattr_rm(), which was broken--effectively a nop.
- Minor comment and whitespace fixes.

Obtained from: TrustedBSD Project


65721 11-Sep-2000 jhb

Fix a 64-bitism. Use size_t instead of int for 4th argument to copyinstr.

Approved by: rwatson


65595 07-Sep-2000 mckusick

Cannot do MALLOC with M_WAITOK while holding ACQUIRE_LOCK

Obtained from: Ethan Solomita <ethan@geocast.com>


65557 07-Sep-2000 jasone

Major update to the way synchronization is done in the kernel. Highlights
include:

* Mutual exclusion is used instead of spl*(). See mutex(9). (Note: The
alpha port is still in transition and currently uses both.)

* Per-CPU idle processes.

* Interrupts are run in their own separate kernel threads and can be
preempted (i386 only).

Partially contributed by: BSDi (BSD/OS)
Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh


65377 02-Sep-2000 rwatson

Modify extended attribute protection model to authorize based on
attribute namespace and DAC protection on file:
- Attribute names beginning with '$' are in the system namespace
- The attribute name "$" is reserved
- System namespace attributes may only be read/set by suser()
or by kernel (cred == NULL)
- Other attribute names are in the application namespace
- The attribute name "" is reserved
- Application namespace attributes are protected in the manner
of the target file permission

o Kernel changes
- Add ufs_extattr_valid_attrname() to check whether the requested
attribute "set" or "enable" is appropriate (i.e., non-reserved)
- Modify ufs_extattr_credcheck() to accept target file vnode, not
to take inode uid
- Modify ufs_extattr_credcheck() to check namespace, then enforce
either kernel/suser for system namespace, or vaccess() for
application namespace
o EA backing file format changes
- Remove permission fields from extended attribute backing file
header
- Bump extended attribute backing file header version to 3
o Update extattrctl.c and extattrctl.8
- Remove now deprecated -r and -w arguments to initattr, as
permissions are now implicit
- (unrelated) fix error reporting and unlinking during failed
initattr to remove duplicate/inaccurate error messages, and to
only unlink if the failure wasn't in the backing file open()

Obtained from: TrustedBSD Project


65200 29-Aug-2000 rwatson

o Restructure vaccess() so as to check for DAC permission to modify the
object before falling back on privilege. Make vaccess() accept an
additional optional argument, privused, to determine whether
privilege was required for vaccess() to return 0. Add commented
out capability checks for reference. Rename some variables to make
it more clear which modes/uids/etc are associated with the object,
and which with the access mode.
o Update file system use of vaccess() to pass NULL as the optional
privused argument. Once additional patches are applied, suser()
will no longer set ASU, so privused will permit passing of
privilege information up the stack to the caller.

Reviewed by: bde, green, phk, -security, others
Obtained from: TrustedBSD Project


65119 26-Aug-2000 rwatson

o Correct spelling of ufs_exttatr_find_attr -> ufs_extattr_find_attr
o Add "const" qualifier to attrname argument of various calls to remove
warnings

Obtained from: TrustedBSD Project


64880 20-Aug-2000 phk

Remove all traces of Julians DEVFS (incl from kern/subr_diskslice.c)

Remove old DEVFS support fields from dev_t.

Make uid, gid & mode members of dev_t and set them in make_dev().

Use correct uid, gid & mode in make_dev in disk minilayer.

Add support for registering alias names for a dev_t using the
new function make_dev_alias(). These will show up as symlinks
in DEVFS.

Use makedev() rather than make_dev() for MFSs magic devices to prevent
DEVFS from noticing this abuse.

Add a field for DEVFS inode number in dev_t.

Add new DEVFS in fs/devfs.

Add devfs cloning to:
disk minilayer (ie: ad(4), sd(4), cd(4) etc etc)
md(4), tun(4), bpf(4), fd(4)

If DEVFS add -d flag to /sbin/inits args to make it mount devfs.

Add commented out DEVFS to GENERIC


64865 20-Aug-2000 phk

Centralize the canonical vop_access user/group/other check in vaccess().

Discussed with: bde


64437 09-Aug-2000 tegge

Initialize *countp to 0 in stub for softdep_flushworklist().
This allows ffs_fsync() to break out of a loop that might otherwise
be infinite on kernels compiled without the SOFTUPDATES option.
The observed symptom was a system hang at the first unmount attempt.


64104 01-Aug-2000 roberto

Fix the lockmgr panic everyone is seeing at shutdown time.
vput assumes curproc is the lock holder, but it's not true in this case.

Thanks a lot Luoqi !

Submitted by: luoqi
Tested by: phk


63976 28-Jul-2000 peter

Minor tweak - removed unused variable 'struct mount *mp';


63975 28-Jul-2000 peter

Minor change: fix warning - move a 'struct vnode *vp' declaration inside a
#ifdef DIAGNOSTIC to match its corresponding usage.


63897 26-Jul-2000 mckusick

Clean up the snapshot code so that it no longer depends on the use of
the SF_IMMUTABLE flag to prevent writing. Instead put in explicit
checking for the SF_SNAPSHOT flag in the appropriate places. With
this change, it is now possible to rename and link to snapshot files.
It is also possible to set or clear any of the owner, group, or
other read bits on the file, though none of the write or execute
bits can be set. There is also an explicit test to prevent the
setting or clearing of the SF_SNAPSHOT flag via chflags() or
fchflags(). Note also that the modify time cannot be changed as
it needs to accurately reflect the time that the snapshot was taken.

Submitted by: Robert Watson <rwatson@FreeBSD.org>


63889 26-Jul-2000 phk

Fix the "mfs_badop[vop_getwritemount] = 45" messages.


63829 25-Jul-2000 mckusick

Add stub for softdep_flushworklist() so that kernels compiled
without the SOFTUPDATES option will load correctly.

Obtained from: John Baldwin <jhb@bsdi.com>


63828 25-Jul-2000 mckusick

Eliminate periodic 'mfs_badop[vop_getwritemount] = 45' messages.

Submitted by: Sheldon Hearn <sheldonh@uunet.co.za>


63788 24-Jul-2000 mckusick

This patch corrects the first round of panics and hangs reported
with the new snapshot code.

Update addaliasu to correctly implement the semantics of the old
checkalias function. When a device vnode first comes into existence,
check to see if an anonymous vnode for the same device was created
at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than
creating a new vnode for the device. This corrects a problem which
caused the kernel to panic when taking a snapshot of the root
filesystem.

Change the calling convention of vn_write_suspend_wait() to be the
same as vn_start_write().

Split out softdep_flushworklist() from softdep_flushfiles() so that
it can be used to clear the work queue when suspending filesystem
operations.

Access to buffers becomes recursive so that snapshots can recursively
traverse their indirect blocks using ffs_copyonwrite() when checking
for the need for copy on write when flushing one of their own indirect
blocks. This eliminates a deadlock between the syncer daemon and a
process taking a snapshot.

Ensure that softdep_process_worklist() can never block because of a
snapshot being taken. This eliminates a problem with buffer starvation.

Cleanup change in ffs_sync() which did not synchronously wait when
MNT_WAIT was specified. The result was an unclean filesystem panic
when doing forcible unmount with heavy filesystem I/O in progress.

Return a zero'ed block when reading a block that was not in use at
the time that a snapshot was taken. Normally, these blocks should
never be read. However, the readahead code will occationally read
them which can cause unexpected behavior.

Clean up the debugging code that ensures that no blocks be written
on a filesystem while it is suspended. Snapshots must explicitly
label the blocks that they are writing during the suspension so that
they do not cause a `write on suspended filesystem' panic.

Reorganize ffs_copyonwrite() to eliminate a deadlock and also to
prevent a race condition that would permit the same block to be
copied twice. This change eliminates an unexpected soft updates
inconsistency in fsck caused by the double allocation.

Use bqrelse rather than brelse for buffers that will be needed
soon again by the snapshot code. This improves snapshot performance.


63099 14-Jul-2000 rwatson

o Marius pointed out an unusually inconvenient upper bound on extended
attribute data size.
o Fortunately it turned out to be an unused constant left over from an
earlier implementation, and is therefore being removed so as not to
confuse casual observers.

Submitted by: mbendiks@eunet.no


63059 13-Jul-2000 bp

Prevent possible dereference of NULL pointer.

Submitted by: Marius Bendiksen <mbendiks@eunet.no>


62985 12-Jul-2000 mckusick

Brain fault, forgot to update ffs_snapshot.c with the new calling convention
for vn_start_write.


62976 11-Jul-2000 mckusick

Add snapshots to the fast filesystem. Most of the changes support
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.

Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).


62968 11-Jul-2000 mckusick

Clean up warning about undeclared function by declaring softdep_fsync
in mount.h instead of ffs_extern.h. The correct solution is to use
an indirect function pointer so that the kernel does not have to be
built with options FFS, but that will be left for another day.


62907 10-Jul-2000 phk

Finish repo-copy:

Move ufs/ufs/ufs_disksubr.c to kern/subr_disklabel.c.

These functions are not UFS specific and are in fact used all over the place.


62799 08-Jul-2000 mckusick

Delete README as it is now obsolete. Relevant information is in
README.softupdates.


62798 08-Jul-2000 mckusick

Update to reflect current status.


62553 04-Jul-2000 mckusick

Get userland visible flags added for snapshots to give a few days
advance preparation for them to get migrated into place so that
subsequent changes in utilities will not fail to compile for lack
of up-to-date header files in /usr/include.


62550 04-Jul-2000 mckusick

Move the truncation code out of vn_open and into the open system call
after the acquisition of any advisory locks. This fix corrects a case
in which a process tries to open a file with a non-blocking exclusive
lock. Even if it fails to get the lock it would still truncate the
file even though its open failed. With this change, the truncation
is done only after the lock is successfully acquired.

Obtained from: BSD/OS


62469 03-Jul-2000 phk

Make the two calls from kern/* into softupdates #ifdef SOFTUPDATES,
that is way cleaner than using the softupdates_stub stunt, which
should be killed when convenient.

Discussed with: mckusick


62148 27-Jun-2000 phk

Move prtactive to vfs from ufs. It is used all over the place.


62033 24-Jun-2000 ache

Remove obsoleted info about linking from contrib


61926 22-Jun-2000 mckusick

Update to new copyright.


61813 18-Jun-2000 mckusick

When running with quotas enabled on a filesystem using soft updates,
the system would panic when a user's inode quota was exceeded (see
PR 18959 for details). This fixes that problem.

PR: 18959
Submitted by: Jason Godsey <jason@unixguy.fidalgo.net>


61812 18-Jun-2000 mckusick

Some additional performance improvements. When freeing an inode
check to see if it has been committed to disk. If it has never
been written, it can be freed immediately. For short lived files
this change allows the same inode to be reused repeatedly.
Similarly, when upgrading a fragment to a larger size, if it
has never been claimed by an inode on disk, it too can be freed
immediately making it available for reuse often in the next slowly
growing block of the same file.


61730 16-Jun-2000 phk

Revert part of my bioops change which implemented panic(8).


61729 16-Jun-2000 phk

ARGH! I have too many source trees :-(

Fix prototype errors in last commit.


61724 16-Jun-2000 phk

Virtualizes & untangles the bioops operations vector.

Ref: Message-ID: <18317.961014572@critter.freebsd.dk> To: current@


61698 14-Jun-2000 phk

Remove a comment which should never have made it in.


61281 05-Jun-2000 rwatson

o Remove unneeded off_t variable to clean up compile warning

Obtained from: TrustedBSD Project


61237 04-Jun-2000 rwatson

o If FFS_EXTATTR is defined, don't print out an error message on unmount
if an FFS partition returns EOPNOTSUPP, as it just means extended
attributes weren't enabled on that partition. Prevents spurious
warning per-partition at shutdown.


60938 26-May-2000 jake

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


60833 23-May-2000 jake

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


60165 07-May-2000 rwatson

s/ffs_unmonut/ffs_unmount/ in a gratuitous ufs_extattr printf.

Reported by: knu


60041 05-May-2000 phk

Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by: peter


59913 03-May-2000 rwatson

Don't allow VOP_GETEXTATTR to set uio->uio_offset != 0, as we don't
provide locking over extended attribute operations, requiring that
individual operations be atomic. Allowing non-zero starting offsets
permits applications/etc to put themselves at risk for inconsistent
behavior. As VOP_SETEXTATTR already prohibited non-zero write offsets,
this makes sense.

Suggested by: Andreas Gruenbacher <a.gruenbacher@bestbits.at>


59794 30-Apr-2000 phk

Remove unneeded #include <vm/vm_zone.h>

Generated by: src/tools/tools/kerninclude


59762 29-Apr-2000 phk

s/biowait/bufwait/g

Prodded by: several.


59760 29-Apr-2000 phk

Remove unneeded #include <sys/kernel.h>


59721 28-Apr-2000 mckusick

When files are given to users by root, the quota system failed to
reset their grace timer as their ownership crossed the soft limit
threshhold. Thus if they had been over their limit in the past,
they were suddenly penalized as if they had been over their limit
ever since. The fix is to check when root gives away files, that
when the receiving user crosses their soft limit, their grace timer
is reset. See the PR report for a detailed method of reproducing
the bug.

PR: kern/17128
Submitted by: Andre Albsmeier <andre.albsmeier@mchp.siemens.de>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>


59483 22-Apr-2000 phk

Convert the magic MFS device to a VCHR.

Detected by: obrien


59400 19-Apr-2000 rwatson

o Introduce an extended attribute backing file header magic number
o Introduce an extended attribute backing file header version number


59391 19-Apr-2000 phk

Remove ~25 unneeded #include <sys/conf.h>
Remove ~60 unneeded #include <sys/malloc.h>


59388 19-Apr-2000 rwatson

o Cause attribute data writes to use IO_SYNC since this improves the
chances of consistency with other file/directory meta-data in a
write. In the current set of extended attribute applications,
this does not hurt much. This should be discussed again later when
it comes time to optimize performance of attributes.

o Include an inode generation number in the per-attribute header
information. This allows consistency verification to catch when
a crash occurs, or an inode is recycled while attributes are not
properly configured. For now, an irritating error message is
displayed when an inconsistency occurs. At some point, may introduce
an ``extattrctl check ...'' which catches these before attributes
are enabled. Not today. If you get this message, it means you
somehow managed to get your attribute backing file out of synch
with the file system. When this occurs, attribute not found is
returned (== undefined). Writes will overwrite the value there
correcting the problem. Might want to think about introducing
a new errno or two to handle this kind of situation.

Discussed with: kris


59363 18-Apr-2000 phk

Retire bufqdisksort(), all drivers use bioqdisksort now.


59308 17-Apr-2000 jlemon

Remove unneeded cast.


59289 16-Apr-2000 jlemon

Replace the POLLEXTEND extensions with the kqueue() mechanism.


59268 16-Apr-2000 rwatson

Fix two bugs in extended attribute support for UFS/FFS:

o Put back in {} removed during over-zealous cleanup of gratuitous
debugging output during preparation for the commit. Due to the
missing {}, writes on extended attributes always silently failed.
Doh.

o Don't unlock the target vnode if it's the backing vnode, as we
don't lock the target vnode if it's the backing vnode.


59249 15-Apr-2000 phk

Complete the bio/buf divorce for all code below devfs::strategy

Exceptions:
Vinum untouched. This means that it cannot be compiled.
Greg Lehey is on the case.

CCD not converted yet, casts to struct buf (still safe)

atapi-cd casts to struct buf to examine B_PHYS


59241 15-Apr-2000 rwatson

Introduce extended attribute support for FFS, allowing arbitrary
(name, value) pairs to be associated with inodes. This support is
used for ACLs, MAC labels, and Capabilities in the TrustedBSD
security extensions, which are currently under development.

In this implementation, attributes are backed to data vnodes in the
style of the quota support in FFS. Support for FFS extended
attributes may be enabled using the FFS_EXTATTR kernel option
(disabled by default). Userland utilities and man pages will be
committed in the next batch. VFS interfaces and man pages have
been in the repo since 4.0-RELEASE and are unchanged.

o ufs/ufs/extattr.h: UFS-specific extattr defines
o ufs/ufs/ufs_extattr.c: bulk of support routines
o ufs/{ufs,ffs,mfs}/*.[ch]: hooks and extattr.h includes
o contrib/softupdates/ffs_softdep.c: extattr.h includes
o conf/options, conf/files, i386/conf/LINT: added FFS_EXTATTR

o coda/coda_vfsops.c: XXX required extattr.h due to ufsmount.h
(This should not be the case, and will be fixed in a future commit)

Currently attributes are not supported in MFS. This will be fixed.

Reviewed by: adrian, bp, freebsd-fs, other unthanked souls
Obtained from: TrustedBSD Project


58942 02-Apr-2000 phk

Clone bio versions of certain bits of infrastructure:
devstat_end_transaction_bio()
bioq_* versions of bufq_* incl bioqdisksort()
the corresponding "buf" versions will disappear when no longer used.

Move b_offset, b_data and b_bcount to struct bio.

Add BIO_FORMAT as a hack for fd.c etc.

We are now largely ready to start converting drivers to use struct
bio instead of struct buf.


58934 02-Apr-2000 phk

Move B_ERROR flag to b_ioflags and call it BIO_ERROR.

(Much of this done by script)

Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED.

Move b_pblkno and b_iodone_chain to struct bio while we transition, they
will be obsoleted once bio structs chain/stack.

Add bio_queue field for struct bio aware disksort.

Address a lot of stylistic issues brought up by bde.


58909 02-Apr-2000 dillon

Change the write-behind code to take more care when starting
async I/O's. The sequential read heuristic has been extended to
cover writes as well. We continue to call cluster_write() normally,
thus blocks in the file will still be reallocated for large (but still
random) I/O's, but I/O will only be initiated for truely sequential
writes.

This solves a number of annoying situations, especially with DBM (hash
method) writes, and also has the side effect of fixing a number of
(stupid) benchmarks.

Reviewed-by: mckusick


58365 20-Mar-2000 phk

diff, patch and cvs didn't like these three last time around, try again.


58349 20-Mar-2000 phk

Rename the existing BUF_STRATEGY() to DEV_STRATEGY()

substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)

substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)

This patch is machine generated except for the ccd.c and buf.h parts.


58345 20-Mar-2000 phk

Remove B_READ, B_WRITE and B_FREEBUF and replace them with a new
field in struct buf: b_iocmd. The b_iocmd is enforced to have
exactly one bit set.

B_WRITE was bogusly defined as zero giving rise to obvious coding
mistakes.

Also eliminate the redundant struct buf flag B_CALL, it can just
as efficiently be done by comparing b_iodone to NULL.

Should you get a panic or drop into the debugger, complaining about
"b_iocmd", don't continue. It is likely to write on your disk
where it should have been reading.

This change is a step in the direction towards a stackable BIO capability.

A lot of this patch were machine generated (Thanks to style(9) compliance!)

Vinum users: Greg has not had time to test this yet, be careful.


58155 17-Mar-2000 mckusick

Use 64-bit math to calculate if we have hit our freespace limit.
Necessary for coherent results on filesystems bigger than 0.5Tb.


58088 15-Mar-2000 mckusick

Bug fixes for currently harmless bugs that could rise to bite
the unwary if the code were called in slightly different ways.

1) In ufs_bmaparray() the code for calculating 'runb' will stop one block
short of the first entry in an indirect block. i.e. if an indirect block
contains N block numbers b[0]..b[N-1] then the code will never check if
b[0] and b[1] are sequential. For reference, compare with the equivalent
code that deals with direct blocks.

2) In ufs_lookup() there is an off-by-one error in the test that checks
if dp->i_diroff is outside the range of the the current directory size.
This is completely harmless, since the following while-loop condition
'dp->i_offset < endsearch' is never met, so the code immediately
does a second pass starting at dp->i_offset = 0.

3) Again in ufs_lookup(), the condition in a sanity check is wrong
for directories that are longer than one block. This bug means that
the sanity check is only effective for small directories.

Submitted by: Ian Dowse <iedowse@maths.tcd.ie>


58087 15-Mar-2000 mckusick

Use 64-bit math to decide if optimization needs to be changed.
Necessary for coherent results on filesystems bigger than 0.5Tb.

Submitted by: Paul Saab <ps@yahoo-inc.com>


57869 09-Mar-2000 dillon

In the 'found' case for ufs_lookup() the underlying bp's data was
being accessed after the bp had been releaed. A simple move of the
brelse() solves the problem.

Approved by: jkh
Submitted by: Ian Dowse <iedowse@maths.tcd.ie>


57446 24-Feb-2000 dillon

Fix a 'freeing free block' panic in UFS. The problem occurs when the
filesystem fills up. If the first indirect block exists and FFS is able
to allocate deeper indirect blocks, but is not able to allocate the
data block, FFS improperly unwinds the indirect blocks and leaves a
block pointer hanging to a freed block. This will cause a panic later
when the file is removed. The solution is to properly account for the
first block-pointer-to-an-indirect-block we had to create in a balloc
operation and then unwind it if a failure occurs.

Detective work by: Ian Dowse <iedowse@maths.tcd.ie>
Reviewed by: mckusick, Ian Dowse <iedowse@maths.tcd.ie>
Approved by: jkh


57387 22-Feb-2000 rwatson

After much consulting with bde, concluded that this fix was the best fix
to the current jail/chflags interactions. This fix conditionalizes ``root
behavior'' in the chflags() case on not being in jail, so attempts to
perform a chflags in a jail are limited to what a normal user could do.
For example, this does allow setting of user flags as appropriate, but
prohibits changing of system flags.

Reviewed by: bde


57347 20-Feb-2000 rwatson

Disable chflags() from within jail() so that root within jail can't make
a mess in securelevel environments. Results in one warning during
/etc/rc as it attempts to remove file flags, but this is harmless.

Approved by: High Lord Hubbard


56908 30-Jan-2000 mckusick

When writing out bitmap buffers, need to skip over ones that already
have a write in progress. Otherwise one can get in an infinite loop
trying to get them all flushed.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>


56209 18-Jan-2000 mckusick

During fastpath processing for removal of a short-lived inode, the
set of restrictions for cancelling an inode dependency (inodedep)
is somewhat stronger than originally coded. Since this check appears
in two places, we codify it into the function check_inode_unwritten
which we then call from the two sites, one freeing blocks and the
other freeing directory entries.

Submitted by: Steinar Haug via Matthew Dillon


56208 18-Jan-2000 mckusick

Need to reorganize the flushing of directory entry (pagedep) dependencies
so that they never try to lock an inode corresponding to ".." as this
can lead to deadlock. We observe that any inode with an updated link count
is always pushed into its buffer at the time of the link count change, so
we do not need to do a VOP_UPDATE, but merely find its buffer and write it.
The only time we need to get the inode itself is from the result of a
mkdir whose name will never be ".." and hence locking such an inode will
never request a lock above us in the filesystem tree. Thanks to Brian
Fundakowski Feldman for providing the test program that tickled soft updates
into hanging in "inode" sleep.

Submitted by: Brian Fundakowski Feldman <green@FreeBSD.org>


56150 17-Jan-2000 mckusick

Better bounding on softdep_flushfiles; other minor tweeks to checks.


56149 17-Jan-2000 mckusick

Must track multiple uncommitted renames until one ultimately gets
committed to disk or is removed.


55947 14-Jan-2000 dillon

Non-operational change, fix compiler warning.

Reviewed by: mckusick


55931 13-Jan-2000 mckusick

Confirming Peter's fix (locking 101: release the lock before you go
to sleep). Locking 101, part 2: do not look at buffer contents after
you have been asleep. There is no telling what wonderous changes may
have occurred.


55928 13-Jan-2000 peter

Free the global softupdates lock prior to tsleep() in getdirtybuf().
This seems to be responsible for a bunch of panics where the process
sleeps and something else finds softupdates "locked" when it shouldn't
be. This commit is unreviewed, but has been a big help here.
Previously my boxes would panic pretty much on the first fsync() that
wrote something to disk.


55886 13-Jan-2000 mckusick

Because cylinder group blocks are now written in background,
it is no longer sufficient to get a lock on a buffer to know
that its write has been completed. We have to first get the
lock on the buffer, then check to see if it is doing a
background write. If it is doing background write, we have
to wait for the background write to finish, then check to see
if that fullfilled our dependency, and if not to start another
write. Luckily the explanation is longer than the fix.


55885 13-Jan-2000 mckusick

A panic occurs during an fsync when a dirty block associated with
a vnode has not been written (which would clear certain of its
dependencies). The problems arises because fsync with MNT_NOWAIT
no longer pushes all the dirty blocks associated with a vnode. It
skips those that require rollbacks, since they will just get instantly
dirty again. Such skipped blocks are marked so that they will not be
skipped a second time (otherwise circular dependencies would never
clear). So, we fsync twice to ensure that everything will be written
at least once.


55799 11-Jan-2000 mckusick

The only known cause of this panic is running out of disk space.
The problem occurs when an indirect block and a data block are
being allocated at the same time. For example when the 13th block
of the file is written, the filesystem needs to allocate the first
indirect block and a data block. If the indirect block allocation
succeeds, but the data block allocation fails, the error code
dellocates the indirect block as it has nothing at which to point.
Unfortunately, it does not deallocate the indirect block's associated
dependencies which then fail when they find the block unexpectedly
gone (ptr == 0 instead of its expected value). The fix is to fsync
the file before doing the block rollback, as the fsync will flush
out all of the dependencies. Once the rollback is done the file
must be fsync'ed again so that the soft updates code does not find
unexpected changes. This approach is much slower than writing the
code to back out the extraneous dependencies, but running out of
disk space is not expected to be a common occurence, so just getting
it right is the main criterion.

PR: kern/15063
Submitted by: Assar Westerlund <assar@stacken.kth.se>


55794 11-Jan-2000 mckusick

We cannot proceed to free the blocks of the file until the dependencies
have been cleaned up by deallocte_dependencies(). Once that is done, it
is safe to post the request to free the blocks. A similar change is also
needed for the freefile case.


55756 10-Jan-2000 phk

Give vn_isdisk() a second argument where it can return a suitable errno.

Suggested by: bde


55726 10-Jan-2000 mckusick

Missing FREE_LOCK call before handle_workitem_freeblocks.

Submitted by: "Kenneth D. Merry" <ken@kdm.org>


55697 10-Jan-2000 mckusick

Several performance improvements for soft updates have been added:
1) Fastpath deletions. When a file is being deleted, check to see if it
was so recently created that its inode has not yet been written to
disk. If so, the delete can proceed to immediately free the inode.
2) Background writes: No file or block allocations can be done while the
bitmap is being written to disk. To avoid these stalls, the bitmap is
copied to another buffer which is written thus leaving the original
available for futher allocations.
3) Link count tracking. Constantly track the difference in i_effnlink and
i_nlink so that inodes that have had no change other than i_effnlink
need not be written.
4) Identify buffers with rollback dependencies so that the buffer flushing
daemon can choose to skip over them.


55694 09-Jan-2000 mckusick

Keep tighter control of removal dependencies by limiting the number
of dirrem structure rather than the collaterally created freeblks
and freefile structures. Limit the rate of buffer dirtying by the
syncer process during periods of intense file removal.


55692 09-Jan-2000 mckusick

Reorganize softdep_fsync so that it only does the inode-is-flushed
check before the inode is unlocked while grabbing its parent directory.
Once it is unlocked, other operations may slip in that could make
the inode-is-flushed check fail. Allowing other writes to the inode
before returning from fsync does not break the semantics of fsync
since we have flushed everything that was dirty at the time of the
fsync call.


55691 09-Jan-2000 mckusick

Get rid of unreferenced function.


55690 09-Jan-2000 mckusick

Make static non-exported functions from soft updates.


55206 29-Dec-1999 peter

Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL"
is an application space macro and the applications are supposed to be free
to use it as they please (but cannot). This is consistant with the other
BSD's who made this change quite some time ago. More commits to come.


55029 23-Dec-1999 bde

Update the unclean flag for mount -u. I forgot to handle this case
when I made the absence of the clean flag sticky in rev.1.88. This
was a problem main for "mount /". There is no way to mount "/" for
writing without using mount -u (normally implicitly), so after
"mount -f /" of an unclean filesystem, the absence of the clean flag
was sticky forever.


54952 21-Dec-1999 eivind

Change incorrect NULLs to 0s


54803 19-Dec-1999 rwatson

Second pass commit to introduce new ACL and Extended Attribute system
calls, vnops, vfsops, both in /kern, and to individual file systems that
require a vfsop_ array entry.

Reviewed by: eivind


54700 16-Dec-1999 mckusick

The function request_cleanup() had a tsleep() with PCATCH. It is
quite dangerous, since the process may hold locks at the point,
and if it is stopped in that tsleep the machine may hang. Because
the sleep is so short, the PCATCH is not required here, so it has
been removed. For the future, the FreeBSD team needs to decide
whether it is still reasonable to stop a process in tsleep, as that
may affect any other code that uses PCATCH while holding kernel locks.

Submitted by: Dmitrij Tejblum <tejblum@arc.hq.cti.ru>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>


54655 15-Dec-1999 eivind

Introduce NDFREE (and remove VOP_ABORTOP)


54444 11-Dec-1999 eivind

Lock reporting and assertion changes.
* lockstatus() and VOP_ISLOCKED() gets a new process argument and a new
return value: LK_EXCLOTHER, when the lock is held exclusively by another
process.
* The ASSERT_VOP_(UN)LOCKED family is extended to use what this gives them
* Extend the vnode_if.src format to allow more exact specification than
locked/unlocked.

This commit should not do any semantic changes unless you are using
DEBUG_VFS_LOCKS.

Discussed with: grog, mch, peter, phk
Reviewed by: peter


54049 03-Dec-1999 billf

Remove the 'alpha, use at your own risk' death-statement.

Reviewed by: mckusick (verbally at FreeBSDcon)


54048 03-Dec-1999 billf

Fix typo, add $FreeBSD$


53996 01-Dec-1999 mckusick

Preferentially allocate the first indirect block in the same
cylinder group as the inode. This makes a 15% difference in
read speed for files in the 96K to 500K size range.


53722 26-Nov-1999 phk

Retire MFS_ROOT and MFS_ROOT_SIZE options from the MFS implementation.

Add MD_ROOT and MD_ROOT_SIZE options to the md driver.

Make the md driver handle MFS_ROOT and MFS_ROOT_SIZE options for compatibility.

Add md driver to GENERIC, PCCARD and LINT.

This is a cleanup which removes the need for some of the worse hacks in
MFS: We really want to have a rootvnode but MFS on a preloaded image
doesn't really have one. md is a true device, so it is less trouble.

This has been tested with make release, and if people remember to add
the "md" pseudo-device to their kernels, PicoBSD should be just fine
as well. If people have no other use for MFS, it can be removed from
the kernel.


53577 22-Nov-1999 phk

Convert various pieces of code to use vn_isdisk() rather than checking
for vp->v_type == VBLK.

In ccd: we don't need to call VOP_GETATTR to find the type of a vnode.

Reviewed by: sos


53464 20-Nov-1999 eivind

We do not have ffs_checkexp, so remove the prototype


53452 20-Nov-1999 phk

struct mountlist and struct mount.mnt_list have no business being
a CIRCLEQ. Change them to TAILQ_HEAD and TAILQ_ENTRY respectively.

This removes ugly mp != (void*)&mountlist comparisons.

Requested by: phk
Submitted by: Jake Burkholder jake@checker.org
PR: 14967


53360 18-Nov-1999 peter

Fix a warning (unused static declaration without MFS_ROOT)


53131 13-Nov-1999 eivind

Remove WILLRELE from VOP_SYMLINK

Note: Previous commit to these files (except coda_vnops and devfs_vnops)
that claimed to remove WILLRELE from VOP_RENAME actually removed it from
VOP_MKNOD.


53101 12-Nov-1999 eivind

Remove WILLRELE from VOP_RENAME


53059 09-Nov-1999 phk

Next step in the device cleanup process.

Correctly lock vnodes when calling VOP_OPEN() from filesystem mount code.

Unify spec_open() for bdev and cdev cases.

Remove the disabled bdev specific read/write code.


52838 03-Nov-1999 bde

Quick fix for breakage of ext2fs link counts as reported by stat(2) by
the soft updates changes: only report the link count to be i_effnlink
in ufs_getattr() for file systems that maintain i_effnlink.

Tested by: Mike Dracopoulos <mdraco@math.uoa.gr>


52836 03-Nov-1999 msmith

Make MFS work with the new root filesystem search process.

In order to achieve this, root filesystem mount is moved from
SI_ORDER_FIRST to SI_ORDER_SECOND in the SI_SUB_MOUNT_ROOT sysinit
group. Now, modules which wish to usurp the default root mount
can use SI_ORDER_FIRST.

A compiled-in or preloaded MFS filesystem will become the root
filesystem unless the vfs.root.mountfrom environment variable refers
to a valid bootable device. This will normally only be the case when
the kernel and MFS image have been loaded from a disk which has a
valid /etc/fstab file. In this case, the variable should be manually
overridden in the loader, or the kernel booted with -a. In either
case "mfs:" should be supplied as the new value.

Also fix a typo in one DFLTROOT case that would not have compiled.


52782 01-Nov-1999 msmith

Newline-terminate the complaint message about not being able to find
the root vnode pointer.


52641 30-Oct-1999 dillon

Add sysctl debug.dircheck to allow directory sanity checking to be turned
on with a sysctl.

Fix two bugs in ufs_lookup that can cause deadlocks due to out-of-order
locking. This fix was tested for a few days prior to commit.


52635 29-Oct-1999 phk

useracc() the prequel:

Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.


51808 30-Sep-1999 phk

Remove the D_NOCLUSTER[RW] options which were added because vn had
problems. Now that Matt has fixed vn, this can go. The vn driver
should have used d_maxio (now si_iosize_max) anyway.


51797 29-Sep-1999 phk

Remove v_maxio from struct vnode.

Replace it with mnt_iosize_max in struct mount.

Nits from: bde


51791 29-Sep-1999 marcel

sigset_t change (part 2 of 5)
-----------------------------

The core of the signalling code has been rewritten to operate
on the new sigset_t. No methodological changes have been made.
Most references to a sigset_t object are through macros (see
signalvar.h) to create a level of abstraction and to provide
a basis for further improvements.

The NSIG constant has not been changed to reflect the maximum
number of signals possible. The reason is that it breaks
programs (especially shells) which assume that all signals
have a non-null name in sys_signame. See src/bin/sh/trap.c
for an example. Instead _SIG_MAXSIG has been introduced to
hold the maximum signal possible with the new sigset_t.

struct sigprop has been moved from signalvar.h to kern_sig.c
because a) it is only used there, and b) access must be done
though function sigprop(). The latter because the table doesn't
holds properties for all signals, but only for the first NSIG
signals.

signal.h has been reorganized to make reading easier and to
add the new and/or modified structures. The "old" structures
are moved to signalvar.h to prevent namespace polution.

Especially the coda filesystem suffers from the change, because
it contained lines like (p->p_sigmask == SIGIO), which is easy
to do for integral types, but not for compound types.

NOTE: kdump (and port linux_kdump) must be recompiled.

Thanks to Garrett Wollman and Daniel Eischen for pressing the
importance of changing sigreturn as well.


51658 25-Sep-1999 phk

Remove five now unused fields from struct cdevsw. They should never
have been there in the first place. A GENERIC kernel shrinks almost 1k.

Add a slightly different safetybelt under nostop for tty drivers.

Add some missing FreeBSD tags


51486 20-Sep-1999 dillon

More removals of vnode->v_lastr, replaced by preexisting seqcount
heuristic to detect sequential operation.

VM-related forced clustering code removed from ufs in preparation for a
commit to vm/vm_fault.c that does it more generally.

Reviewed by: David Greenman <dg@root.com>, Alan Cox <alc@cs.rice.edu>


51483 20-Sep-1999 phk

Fix a harmless bug I introduced, simplify a bit more while here.


51479 20-Sep-1999 phk

Step one of replacing devsw->d_maxio with si_bsize_max.

Rename dev->si_bsize_max to si_iosize_max and set it in spec_open
if the device didn't.

Set vp->v_maxio from dev->si_bsize_max in spec_open rather than
in ufs_bmap.c


51226 13-Sep-1999 bde

Removed diskerr()'s unused d_name arg and updated callers. This fixes
warnings caused by the arg having the wrong type (not const enough).
The arg was also wrong (a full name instead of a short one) for calls
from from subr_diskmbr.c and pc98/diskslice_machdep.c.


51138 11-Sep-1999 alfred

Seperate the export check in VFS_FHTOVP, exports are now checked via
VFS_CHECKEXP.

Add fh(open|stat|stafs) syscalls to allow userland to query filesystems
based on (network) filehandle.

Obtained from: NetBSD


51111 09-Sep-1999 julian

Changes to centralise the default blocksize behaviour.
More likely to follow.

Submitted by: phk@freebsd.org


50830 03-Sep-1999 julian

Revert a bunch of contraversial changes by PHK. After
a quick think and discussion among various people some form of some of
these changes will probably be recommitted.

The reversion requested was requested by dg while discussions proceed.
PHK has indicated that he can live with this, and it has been agreed
that some form of some of these changes may return shortly after further
discussion.


50623 30-Aug-1999 phk

Make bdev userland access work like cdev userland access unless
the highly non-recommended option ALLOW_BDEV_ACCESS is used.

(bdev access is evil because you don't get write errors reported.)

Kill si_bsize_best before it kills Matt :-)

Use the specfs routines rather having cloned copies in devfs.


50521 28-Aug-1999 phk

remove unused variables.


50511 28-Aug-1999 phk

We don't need to pass the diskname argument all over the diskslice/label
code, we can find the name from any convenient dev_t


50480 28-Aug-1999 peter

$Id$ -> $FreeBSD$


50477 28-Aug-1999 peter

$Id$ -> $FreeBSD$


50405 26-Aug-1999 phk

Simplify the handling of VCHR and VBLK vnodes using the new dev_t:

Make the alias list a SLIST.

Drop the "fast recycling" optimization of vnodes (including
the returning of a prexisting but stale vnode from checkalias).
It doesn't buy us anything now that we don't hardlimit
vnodes anymore.

Rename checkalias2() and checkalias() to addalias() and
addaliasu() - which takes dev_t and udev_t arg respectively.

Make the revoke syscalls use vcount() instead of VALIASED.

Remove VALIASED flag, we don't need it now and it is faster
to traverse the much shorter lists than to maintain the
flag.

vfs_mountedon() can check the dev_t directly, all the vnodes
point to the same one.

Print the devicename in specfs/vprint().

Remove a couple of stale LFS vnode flags.

Remove unimplemented/unused LK_DRAINED;


50347 25-Aug-1999 phk

Introduce vn_isdisk(struct vnode *vp) function, and use it to test for diskness.


50316 24-Aug-1999 phk

Initialize the si_bsize fields for the MFS bogodevices.

(This broke MFS rootfs and thereby installation)


50305 24-Aug-1999 sheldonh

Fix bug introduced in rev 1.28, which causes kernel build to break for
the case where DEBUG is defined but not DIAGNOSTIC. ffs_checkblk is
declared conditionally on DIAGNOSTIC, not DEBUG.

PR: 13314
Reviewed by: bde


50253 23-Aug-1999 bde

Use devtoname() to print dev_t's instead of casting them to long or u_long
for misprinting in %lx format.


50137 22-Aug-1999 jdp

Support full-precision file timestamps. Until now, only the seconds
have been maintained, and that is still the default. A new sysctl
variable "vfs.timestamp_precision" can be used to enable higher
levels of precision:

0 = seconds only; nanoseconds zeroed (default).
1 = seconds and nanoseconds, accurate within 1/HZ.
2 = seconds and nanoseconds, truncated to microseconds.
>=3 = seconds and nanoseconds, maximum precision.

Level 1 uses getnanotime(), which is fast but can be wrong by up
to 1/HZ. Level 2 uses microtime(). It might be desirable for
consistency with utimes() and friends, which take timeval structures
rather than timespecs. Level 3 uses nanotime() for the higest
precision.

I benchmarked levels 0, 1, and 3 by copying a 550 MB tree with
"cpio -pdu". There was almost negligible difference in the system
times -- much less than 1%, and less than the variation among
multiple runs at the same level. Bruce Evans dreamed up a torture
test involving 1-byte reads with intervening fstat() calls, but
the cpio test seems more realistic to me.

This feature is currently implemented only for the UFS (FFS and
MFS) filesystems. But I think it should be easy to support it in
the others as well.

An earlier version of this was reviewed by Bruce. He's not to
blame for any breakage I've introduced since then.

Reviewed by: bde (an earlier version of the code)


49945 17-Aug-1999 alc

Add the (inline) function vm_page_undirty for clearing the dirty bitmask
of a vm_page.

Use it.

Submitted by: dillon


49771 14-Aug-1999 phk

Spring cleaning around strategy and disklabels/slices:

Introduce BUF_STRATEGY(struct buf *, int flag) macro, and use it throughout.
please see comment in sys/conf.h about the flag argument.

Remove strategy argument from all the diskslice/label/bad144
implementations, it should be found from the dev_t.

Remove bogus and unused strategy1 routines.

Remove open/close arguments from dssize(). Pick them up from dev_t.

Remove unused and unfinished setgeom support from diskslice/label/bad144 code.


49682 13-Aug-1999 phk

Move the special-casing of stat(2)->st_blksize for device files
from UFS to the generic level. For chr/blk devices we don't care
about the blocksize of the filesystem, we want what the device
asked for.


49679 13-Aug-1999 phk

The bdevsw() and cdevsw() are now identical, so kill the former.


49678 13-Aug-1999 phk

s/v_specinfo/v_rdev/


49535 08-Aug-1999 phk

Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>,
a few lines into <sys/vnode.h>.

Add a few fields to struct specinfo, paving the way for the fun part.


49338 01-Aug-1999 alc

Move the memory access behavior information provided by madvise
from the vm_object to the vm_map.

Submitted by: dillon


49073 25-Jul-1999 bde

Fixed access timestamp bugs:

Set IN_ACCESS for successful reads of 0 bytes (except for requests to
read 0 bytes). This was broken in rev.1.42.
PR: misc/10148

Don't set IN_ACCESS for requests to read 0 bytes.

Don't set IN_ACCESS for unsuccessful reads.


48936 20-Jul-1999 phk

Now a dev_t is a pointer to struct specinfo which is shared by all specdev
vnodes referencing this device.

Details:
cdevsw->d_parms has been removed, the specinfo is available
now (== dev_t) and the driver should modify it directly
when applicable, and the only driver doing so, does so:
vn.c. I am not sure the logic in checking for "<" was right
before, and it looks even less so now.

An intial pool of 50 struct specinfo are depleted during
early boot, after that malloc had better work. It is
likely that fewer than 50 would do.

Hashing is done from udev_t to dev_t with a prime number
remainder hash, experiments show no better hash available
for decent cost (MD5 is only marginally better) The prime
number used should not be close to a power of two, we use
83 for now.

Add new checkalias2() to get around the loss of info from
dev2udev() in bdevvp();

The aliased vnodes are hung on a list straight of the dev_t,
and speclisth[SPECSZ] is unused. The sharing of struct
specinfo means that the v_specnext moves into the vnode
which grows by 4 bytes.

Don't use a VBLK dev_t which doesn't make sense in MFS, now
we hang a dummy cdevsw on B/Cmaj 253 so that things look sane.

Storage overhead from all of this is O(50k).

Bump __FreeBSD_version to 400009

The next step will add the stuff needed so device-drivers can start to
hang things from struct specinfo


48859 17-Jul-1999 phk

I have not one single time remembered the name of this function correctly
so obviously I gave it the wrong name. s/umakedev/makeudev/g


48801 13-Jul-1999 mckusick

Create the macro DOINGASYNC to check whether the MNT_ASYNC flag has
been set for a mount point. Insert missing checks to ensure that all
write operations are done asynchronously when the MNT_ASYNC option
has been requested.

Submitted by: Craig A Soules <soules+@andrew.cmu.edu>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>


48759 11-Jul-1999 phk

Use the fsid from the superblock, unless it looks bogus or has already
been taken by some other filesystem.


48677 08-Jul-1999 mckusick

These changes appear to give us benefits with both small (32MB) and
large (1G) memory machine configurations. I was able to run 'dbench 32'
on a 32MB system without bring the machine to a grinding halt.

* buffer cache hash table now dynamically allocated. This will
have no effect on memory consumption for smaller systems and
will help scale the buffer cache for larger systems.

* minor enhancement to pmap_clearbit(). I noticed that
all the calls to it used constant arguments. Making
it an inline allows the constants to propogate to
deeper inlines and should produce better code.

* removal of inherent vfs_ioopt support through the emplacement
of appropriate #ifdef's, with John's permission. If we do not
find a use for it by the end of the year we will remove it entirely.

* removal of getnewbufloops* counters & sysctl's - no longer
necessary for debugging, getnewbuf() is now optimal.

* buffer hash table functions removed from sys/buf.h and localized
to vfs_bio.c

* VFS_BIO_NEED_DIRTYFLUSH flag and support code added
( bwillwrite() ), allowing processes to block when too many dirty
buffers are present in the system.

* removal of a softdep test in bdwrite() that is no longer necessary
now that bdwrite() no longer attempts to flush dirty buffers.

* slight optimization added to bqrelse() - there is no reason
to test for available buffer space on B_DELWRI buffers.

* addition of reverse-scanning code to vfs_bio_awrite().
vfs_bio_awrite() will attempt to locate clusterable areas
in both the forward and reverse direction relative to the
offset of the buffer passed to it. This will probably not
make much of a difference now, but I believe we will start
to rely on it heavily in the future if we decide to shift
some of the burden of the clustering closer to the actual
I/O initiation.

* Removal of the newbufcnt and lastnewbuf counters that Kirk
added. They do not fix any race conditions that haven't already
been fixed by the gbincore() test done after the only call
to getnewbuf(). getnewbuf() is a static, so there is no chance
of it being misused by other modules. ( Unless Kirk can think
of a specific thing that this code fixes. I went through it
very carefully and didn't see anything ).

* removal of VOP_ISLOCKED() check in flushbufqueues(). I do not
think this check is necessary, the buffer should flush properly
whether the vnode is locked or not. ( yes? ).

* removal of extra arguments passed to getnewbuf() that are not
necessary.

* missed cluster_wbuild() that had to be a cluster_wbuild_wb() in
vfs_cluster.c

* vn_write() now calls bwillwrite() *PRIOR* to locking the vnode,
which should greatly aid flushing operations in heavy load
situations - both the pageout and update daemons will be able
to operate more efficiently.

* removal of b_usecount. We may add it back in later but for now
it is useless. Prior implementations of the buffer cache never
had enough buffers for it to be useful, and current implementations
which make more buffers available might not benefit relative to
the amount of sophistication required to implement a b_usecount.
Straight LRU should work just as well, especially when most things
are VMIO backed. I expect that (even though John will not like
this assumption) directories will become VMIO backed some point soon.

Submitted by: Matthew Dillon <dillon@backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>


48656 07-Jul-1999 roberto

Add $Id$

Approved by: kirk


48536 03-Jul-1999 jdp

Update pathnames for new location of soft-updates sources.


48334 29-Jun-1999 mckusick

No longer need to set B_ASYNC flag since BUF_KERNPROC now
unconditionally sets the identity of the buffer.


48276 27-Jun-1999 peter

Keep the inlines for <sys/buf.h> happy..


48225 26-Jun-1999 mckusick

Convert buffer locking from using the B_BUSY and B_WANTED flags to using
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.


47995 18-Jun-1999 mckusick

On our final pass through ffs_fsync, do all I/O synchronously so that
we can find out if our flush is failing because of write errors. This
change avoids a "flush failed" panic during unrecoverable disk errors.


47964 16-Jun-1999 mckusick

Add a vnode argument to VOP_BWRITE to get rid of the last vnode
operator special case. Delete special case code from vnode_if.sh,
vnode_if.src, umap_vnops.c, and null_vnops.c.


47940 15-Jun-1999 mckusick

Get rid of the global variable rushjob and replace it with a function in
kern/vfs_subr.c named speedup_syncer() which handles the speedup request.
Change the various clients of rushjob to use the new function.


47640 31-May-1999 phk

Simplify cdevsw registration.

The cdevsw_add() function now finds the major number(s) in the
struct cdevsw passed to it. cdevsw_add_generic() is no longer
needed, cdevsw_add() does the same thing.

cdevsw_add() will print an message if the d_maj field looks bogus.

Remove nblkdev and nchrdev variables. Most places they were used
bogusly. Instead check a dev_t for validity by seeing if devsw()
or bdevsw() returns NULL.

Move bdevsw() and devsw() functions to kern/kern_conf.c

Bump __FreeBSD_version to 400006

This commit removes:
72 bogus makedev() calls
26 bogus SYSINIT functions

if_xe.c bogusly accessed cdevsw[], author/maintainer please fix.

I4b and vinum not changed. Patches emailed to authors. LINT
probably broken until they catch up.


47443 24-May-1999 jb

- Back out Luoqi's cdevsw stuff. It panics on my system and is not required.
- Fix an error message.
- Do the MFS_ROOT setting of mountrootfsname in mfs_init() instead of
cpu_rootconf().
- Set rootdev in mfs_init instead of later in mfs_mount() iff MFS_ROOT.


47381 22-May-1999 julian

Cosmetic changes to make it compile without errors in gcc -Wall


47202 14-May-1999 luoqi

Legally acquire a major number for mfs.


47131 14-May-1999 mckusick

Add a hook to ffs_fsync to allow soft updates to get first chance at doing
a sync on the block device for the filesystem. That allows it to push the
bitmap blocks before the inode blocks which greatly reduces the number of
inode rollbacks that need to be done.


47085 12-May-1999 peter

Try and fix a dev_t/major/minor etc nit.


47028 11-May-1999 phk

Divorce "dev_t" from the "major|minor" bitmap, which is now called
udev_t in the kernel but still called dev_t in userland.

Provide functions to manipulate both types:
major() umajor()
minor() uminor()
makedev() umakedev()
dev2udev() udev2dev()

For now they're functions, they will become in-line functions
after one of the next two steps in this process.

Return major/minor/makedev to macro-hood for userland.

Register a name in cdevsw[] for the "filedescriptor" driver.

In the kernel the udev_t appears in places where we have the
major/minor number combination, (ie: a potential device: we
may not have the driver nor the device), like in inodes, vattr,
cdevsw registration and so on, whereas the dev_t appears where
we carry around a reference to a actual device.

In the future the cdevsw and the aliased-from vnode will be hung
directly from the dev_t, along with up to two softc pointers for
the device driver and a few houskeeping bits. This will essentially
replace the current "alias" check code (same buck, bigger bang).

A little stunt has been provided to try to catch places where the
wrong type is being used (dev_t vs udev_t), if you see something
not working, #undef DEVT_FASCIST in kern/kern_conf.c and see if
it makes a difference. If it does, please try to track it down
(many hands make light work) or at least try to reproduce it
as simply as possible, and describe how to do that.

Without DEVT_FASCIST I belive this patch is a no-op.

Stylistic/posixoid comments about the userland view of the <sys/*.h>
files welcome now, from userland they now contain the end result.

Next planned step: make all dev_t's refer to the same devsw[] which
means convert BLK's to CHR's at the perimeter of the vnodes and
other places where they enter the game (bootdev, mknod, sysctl).


46956 11-May-1999 bde

Fixed disordering in previous 2 commits.


46915 10-May-1999 peter

Move the mfs_getimage() prototype to mfs_extern.h duplicating it
everywhere.


46827 09-May-1999 mckusick

Put back changes that might be causing trouble on Alpha.


46676 08-May-1999 phk

I got tired of seeing all the cdevsw[major(foo)] all over the place.

Made a new (inline) function devsw(dev_t dev) and substituted it.

Changed to the BDEV variant to this format as well: bdevsw(dev_t dev)

DEVFS will eventually benefit from this change too.


46635 07-May-1999 phk

Continue where Julian left off in July 1998:

Virtualize bdevsw[] from cdevsw. bdevsw() is now an (inline)
function.

Join CDEV_MODULE and BDEV_MODULE to DEV_MODULE (please pay attention
to the order of the cmaj/bmaj arguments!)

Join CDEV_DRIVER_MODULE and BDEV_DRIVER_MODULE to DEV_DRIVER_MODULE
(ditto!)

(Next step will be to convert all bdev dev_t's to cdev dev_t's
before they get to do any damage^H^H^H^H^H^Hwork in the kernel.)


46618 07-May-1999 mckusick

Whitespace cleanup.


46616 07-May-1999 mckusick

Get rid of random debugging cruft; sync up with latest version.


46609 07-May-1999 mckusick

Severe slowdowns have been reported when creating or removing many
files at once on a filesystem running soft updates. The root of
the problem is that soft updates limits the amount of memory that
may be allocated to dependency structures so as to avoid hogging
kernel memory. The original algorithm just waited for the disk I/O
to catch up and reduce the number of dependencies. This new code
takes a much more aggressive approach. Basically there are two
resources that routinely hit the limit. Inode dependencies during
periods with a high file creation rate and file and block removal
dependencies during periods with a high file removal rate. I have
attacked these problems from two fronts. When the inode dependency
limits are reached, I pick a random inode dependency, UFS_UPDATE
it together with all the other dirty inodes contained within its
disk block and then write that disk block. This trick usually
clears 5-50 inode dependencies in a single disk I/O. For block and
file removal dependencies, I pick a random directory page that has
at least one remove pending and VOP_FSYNC its directory. That
releases all its removal dependencies to the work queue. To further
hasten things along, I also immediately start the work queue process
rather than waiting for its next one second scheduled run.


46568 06-May-1999 peter

Add sufficient braces to keep egcs happy about potentially ambiguous
if/else nesting.


46349 02-May-1999 alc

The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.

The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.

getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.

There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.

Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>


46155 28-Apr-1999 phk

This Implements the mumbled about "Jail" feature.

This is a seriously beefed up chroot kind of thing. The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.

For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact: "real virtual servers".

Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.

Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.

It generally does what one would expect, but setting up a jail
still takes a little knowledge.

A few notes:

I have no scripts for setting up a jail, don't ask me for them.

The IP number should be an alias on one of the interfaces.

mount a /proc in each jail, it will make ps more useable.

/proc/<pid>/status tells the hostname of the prison for
jailed processes.

Quotas are only sensible if you have a mountpoint per prison.

There are no privisions for stopping resource-hogging.

Some "#ifdef INET" and similar may be missing (send patches!)

If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!

Tools, comments, patches & documentation most welcome.

Have fun...

Sponsored by: http://www.rndassociates.com/
Run for almost a year by: http://www.servetheweb.com/


46124 27-Apr-1999 msmith

Simplify the tunefs example, since tunefs uses getfsfile(). Lots of
people complain about working out what device their filesystems are
mounted on.


46112 27-Apr-1999 phk

Suser() simplification:

1:
s/suser/suser_xxx/

2:
Add new function: suser(struct proc *), prototyped in <sys/proc.h>.

3:
s/suser_xxx(\([a-zA-Z0-9_]*\)->p_ucred, \&\1->p_acflag)/suser(\1)/

The remaining suser_xxx() calls will be scrutinized and dealt with
later.

There may be some unneeded #include <sys/cred.h>, but they are left
as an exercise for Bruce.

More changes to the suser() API will come along with the "jail" code.


45911 21-Apr-1999 dt

Change type of a variable from u_int to size_t, so that pointer to it may be
used as a last argument to copyinstr().


45570 11-Apr-1999 eivind

Correct typo in panic message


45362 06-Apr-1999 peter

Hold the mfs process's upages in-core with PHOLD rather than P_NOSWAP.


45347 05-Apr-1999 julian

Catch a case spotted by Tor where files mmapped could leave garbage in the
unallocated parts of the last page when the file ended on a frag
but not a page boundary.
Delimitted by tags PRE_MATT_MMAP_EOF and POST_MATT_MMAP_EOF,
in files alpha/alpha/pmap.c i386/i386/pmap.c nfs/nfs_bio.c vm/pmap.h
vm/vm_page.c vm/vm_page.h vm/vnode_pager.c miscfs/specfs/spec_vnops.c
ufs/ufs/ufs_readwrite.c kern/vfs_bio.c

Submitted by: Matt Dillon <dillon@freebsd.org>
Reviewed by: Alan Cox <alc@freebsd.org>


45332 05-Apr-1999 peter

There's not much point in the EXPORTMFS #ifdef. I've had this sitting
in my tree for 12+ months, and I just noticed that NetBSD have (I think,
I've just seen the commit, not the change) just zapped it there.
It wasn't in the options files or LINT either.


44675 12-Mar-1999 julian

Stop the mfs from trying to swap out crucial bits of the mfs
as this can lead to deadlock.
Submitted by: Mat dillon <dillon@freebsd.org>


44512 06-Mar-1999 bde

Don't depend on <ufs/ufs/quota.h> or another (old) prerequisite including
<sys/queue.h>. This fixes my recent breakage of biosboot by unpolluting
<ufs/ufs/quota.h> in the !KERNEL case.


44480 05-Mar-1999 bde

Moved kernel declarations inside the KERNEL ifdef, and removed
include of <sys/queue.h> in the !KERNEL case. The prerequisites
for <ufs/ufs/quota.h> were broken in Lite2 by converting some of
the kernel declarations to use queue macros without including
<sys/queue.h>. <sys/queue.h> was included in applications in
/usr/src instead. We polluted this file instead of merging the
changes in the applications.

Include <sys/queue.h> in the KERNEL case, and forward-declare all
structs that are used in prototypes, so that this file is almost
self-sufficient even in the kernel.

Obtained from: mostly from NetBSD


44474 05-Mar-1999 bde

Changed the type of quotactl()'s 4th arg from `char *' to `void *'
so that non-sloppy applications can call it without using disgusting
casts to avoid warnings. The 4th arg is sort of varargs -- it must
sometimes represent a filename, sometimes a struct pointer, and is
sometimes unused. The arg type is still caddr_t in the kernel.

Obtained from: mostly from NetBSD


44398 02-Mar-1999 mckusick

Reorganize locking to avoid holding the lock during calls to bdwrite
and brelse (which may sleep in some systems).

Obtained from: Matthew Dillon <dillon@apollo.backplane.com>


44395 02-Mar-1999 imp

Merge patch to ufs_vnops.c's ufs_rename to the copy of ufs_rename that
lives in ext2_vnops.c for ext2fs. Also remove cast from comparision.
Bruce pointed out that it was bogus since we'd force a signed
comparision when we really wanted an unsigned comparison.


44391 02-Mar-1999 mckusick

When fsync'ing a file on a filesystem using soft updates, we first try
to write all the dirty blocks. If some of those blocks have dependencies,
they will be remarked dirty when the I/O completes. On systems with
really fast I/O systems, it is possible to get in an infinite loop trying
to flush the buffers, because the I/O finishes before we can get all the
dirty buffers off the v_dirtyblkhd list and into the I/O queue. (The
previous algorithm looped over the v_dirtyblkhd list writing out buffers
until the list emptied.) So, now we mark each buffer that we try to
write so that we can distinguish the ones that are being remarked dirty
from those that we have not yet tried to flush. Once we have tried to
push every buffer once, we then push any associated metadata that is
causing the remaining buffers to be redirtied.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>


44383 02-Mar-1999 mckusick

Ensure that softdep_sync_metadata can handle bmsafemap and mkdir entries
if they ever arise (which should not happen as softdep_sync_metadata is
currently used).


44291 26-Feb-1999 imp

Fix last commit based on feedback from Guido, Bruce and Terry.

Specifically, the test was in the wrong place, lacked a cast, didn't
unlock the node, and exited to bad rather than abortit. Now we don't
allow renaming of a file with LINK_MAX references. Move the test to
earlier in the code as it is closer to where ip is obtained, as that
is the style of the rest of the function.

Didn't fix the problems bruce pointed out in the rename man page to
include EMLINK, nor address his complaints about how the whole idea of
incrementing the link count during a rename is potentially asking for
trouble.

Also didn't try to correct potential problem Terry pointed out with
decrements not being similarly protected against underflow.


44253 25-Feb-1999 imp

Add missing check for LINK_MAX in ufs_rename. Since ip->i_effnlink and
ip->nlink were different types, there was a masked overflow.

Reported by: Mark Slemco <marcs@znep.com>


44248 25-Feb-1999 dillon

Update ufs_vnops code to use new specinfo fields rather then guess.
This is part of general specinfo / d_parms() commit.


44102 17-Feb-1999 mckusick

fix double LIST_REMOVE; other cosmetic changes to match version 9.32.
Obtained from: Jeffrey Hsu <hsu@FreeBSD.ORG>


43958 13-Feb-1999 dillon

Remove XXX comment in regarsd to why NFS doesn't use VOP_ABORT(). NFS
is being fixed now.


43311 28-Jan-1999 dillon

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


43287 27-Jan-1999 dillon

Remove unintended trigraph sequences in comments for -Wall


43044 22-Jan-1999 dg

Gutted softdep_deallocate_dependencies and replaced it with a panic. It
turns out to not be useful to unwind the dependencies and continue in
the face of a fatal error.
Also changed the log() to a printf() in softdep_error() so that it will
be output in the case of a impending panic.
Submitted by: Kirk McKusick <mckusick@mckusick.com>


42965 21-Jan-1999 dillon

Added support for VOP_FREEBLKS(), reducing MFS's impact on swap and
increasing performance by deallocating at least some of the backing
store when files are removed.

Protect mfsp->buf_queue access at splbio().


42964 21-Jan-1999 dillon

Access to mfsp->buf_queue must be protected at splbio(). Other minor
adjustments also made, such as passing mfsp to mfs_doio() directly.


42957 21-Jan-1999 dillon

This is a rather large commit that encompasses the new swapper,
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.

Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>


42567 12-Jan-1999 eivind

Silence warning about unused debug function. (I'll turn this function
into a DDB command in my next staticization sweep).


42400 08-Jan-1999 eivind

Add a warning about the copyright restraints.


42374 07-Jan-1999 bde

Don't pass unused unused timestamp args to UFS_UPDATE() or waste
time initializing them. This almost finishes centralizing (in-core)
timestamp updates in ufs_itimes().


42354 06-Jan-1999 bde

UFS_UPDATE() takes a boolean `waitfor' arg, so don't pass it the value
MNT_WAIT when we mean boolean `true' or check for that value not being
passed. There was no problem in practice because MNT_WAIT had the
magic value of 1.


42351 06-Jan-1999 bde

Ifdefed the conditionally used variable `prtrealloc'. Declare it
as volatile so that there is no chance that the code that it controls
is optimised away.


42350 06-Jan-1999 bde

Backed out rev.1.47. It just broke my optimisations for lazy syncing
of timestamps in rev.1.45. The soft updates bug was elsewhere.

Forgotten by: luoqi


42315 05-Jan-1999 eivind

Remove the 'waslocked' parameter to vfs_object_create().


42248 02-Jan-1999 bde

Ifdefed conditionally used simplock variables.


42244 02-Jan-1999 eivind

Remove the last clients of vfs_object_create(..., waslocked=1);
waslocked will go away shortly.

Reviewed by: dg


42216 01-Jan-1999 dillon

The mount_mfs process that stays in a supervisor context handling MFS
I/O requests must be marked P_SYSTEM because if it isn't and the system
decides to swap it or (god forbid) kill it, the system stands a good
chance of locking up.


42042 24-Dec-1998 bde

Fixed null pointer panics which I introduced in rev.1.86. Vnodes
may be revoked, so vnop routines must be careful about accessing
the vnode if they may have blocked.

Fixed marking for update after successfully reading or writing 0
bytes. In this case, POSIX.1 specifies marking if and only if the
requested count is nonzero, but rev.1.86 never marked.


41961 20-Dec-1998 bde

Remove unused file. It seems to have been a vestige of when mfs did its
own memory allocation.


41954 20-Dec-1998 dfr

In ufs_setattr(), if only one of va_atime or va_mtime are != VNOVAL, then
the code set the other field in the inode to VNOVAL. This can happen
sometimes on an NFS server.


41809 15-Dec-1998 julian

Add comments to code that I was trying to understand.
Hopefully will save others time.

Someone who understands this better might check for correctness.


41765 14-Dec-1998 dillon

Fix -Wuninitialized warning regarding zero-length var-args ctl element.
( this isn't really an error, but I think it is important to fix the
warning ).


41659 10-Dec-1998 julian

Remove some compiler warnings.


41610 09-Dec-1998 eivind

Make compare correct with unsigned types. (Problem introduced by Lite/2).


41591 07-Dec-1998 archie

The "easy" fixes for compiling the kernel -Wunused: remove unreferenced static
and local variables, goto labels, and functions declared but not defined.


41395 29-Nov-1998 bde

Don't use the strange null pointer constant `(ufs_daddr_t)0' in a call
to VOP_BMAP(). Don't use uncast NULLs in the same call.


41124 13-Nov-1998 dg

Restored the "reallocblks" code to its former glory. What this does is
basically do a on-the-fly defragmentation of the FFS filesystem, changing
file block allocations to make them contiguous. Thanks to Kirk McKusick
for providing hints on what needed to be done to get this working.


41059 10-Nov-1998 peter

add #include <sys/kernel.h> where it's needed by MALLOC_DEFINE()


40791 31-Oct-1998 peter

Change dirty block list handling to use TAILQ macros.


40790 31-Oct-1998 peter

Use TAILQ macros for clean/dirty block list processing. Set b_xflags
rather than abusing the list next pointer with a magic number.


40692 28-Oct-1998 jkh

Clarify a rather ambiguous debugging message.


40672 27-Oct-1998 bde

Oops, the redundant tests for major numbers weren't redundant here.
They checked for the magic major number for the "device" behind mfs
mount points. Use a more obvious check for this device.

Debugged by: Andrew Gallatin <gallatin@cs.duke.edu>


40660 26-Oct-1998 bde

Removed redundant bitrotted checks for major numbers instead of updating
them.


40649 25-Oct-1998 bde

Don't follow null bdevsw pointers. The `major(dev) < nblkdev' test rotted
when bdevsw[] became sparse. We still depend on magic to avoid having to
check that (v_rdev) device numbers in vnodes are not NODEV.

Removed redundant `major(dev) < nblkdev' tests instead of updating them.


40648 25-Oct-1998 phk

Nitpicking and dusting performed on a train. Removes trivial warnings
about unused variables, labels and other lint.


40469 17-Oct-1998 bde

Use only the correct raw partition for writing labels. Don't use the
partition that the label ioctl is being done on just because it has
offset 0, since there is no guarantee that such a partition is large
enough to contain the label. Don't use the wrong raw partition (0
instead of RAW_PART).

This fixes problems rewriting bizarre labels (with a nonzero offset
for the 'a' partition) in newfs(8). Such labels shouldn't normally
be used, but creating them was allowed if the ioctl was done on the
raw partition, and sysinstall creates them if the root partition isn't
allocated first.

Note that allowing write access to a partition other than the one that
has been checked for write access doesn't increase security holes
significantly, since write access to any partition already allows
changing the in-core label.

This fix should be in 3.0R. Rev.1.26 of newfs/newfs.c shouldn't be
in 3.0R.


40448 16-Oct-1998 jkh

fixup for alpha.


40304 13-Oct-1998 bde

Fixed bloatage of `struct inode'. We used 5 "spare" fields for ext2fs,
but when i_effnlink was added to support soft updates, there was only
room for 4 spares. The number of spares was not reduced, so the inode
size became 260 (on i386's), or 512 after rounding up by malloc().
Use one spare field in `struct dinode' instead of the 5th spare field
in the inode and reduced to 4 spares in the inode so that the size is
256 again.

Changed the types of the spares in the inode from int to u_int32_t
so that the inode size has more chance of being <= 256 under other
arches, and downdated ext2fs to match (it was broken to use ints
before rev.1.1).


40251 12-Oct-1998 peter

"fix" a warning


40164 10-Oct-1998 jkh

Allow more flexible use of MFS root.
Submitted by: peter


40153 09-Oct-1998 peter

MODINFO_ADDR has real addresses now, remove the manual relocation based
on cpu type.


40099 09-Oct-1998 jkh

Add some evil temporary phys-to-kern translation for mfs.


40094 09-Oct-1998 jkh

include proper header for Mike's new stuff.


40084 08-Oct-1998 jkh

Allow the module area to be used in order to find the MFS image
(in addition to allowing it to be compiled in) and stop overloading
the MFS_ROOT variable to store size information.


40038 07-Oct-1998 luoqi

Use vm_page_xxx() inline functions to manipulate vm_page::flags, vm_page::busy.
As a side effect, a few wakeup() calls are added, which might fix some of the
missing vm_page wakeups people have been seeing.

Reviewed by: Doug Rabson <dfr@nlsystems.com>


39933 03-Oct-1998 nate

Fix 'noatime' bug that was unrelated to use of noatime.

The problem is caused when a directory block is compacted. When this
occurs, softdep_change_directoryentry_offset() is called to relocate each
directory entry and adjust its matching diradd structure, if any, to match
the new location of the entry. The bug is that while
softdep_change_directoryentry_offset() correctly adjusts the offsets of
the diradd structures on the pd_diraddhd[] lists (which are not yet ready
to be committed to disk), it fails to adjust the offsets of the diradd
structures on the pd_pendinghd list (which are ready to be committed to
disk). This causes the dependency structures to be inconsistent with
the buf contents. Now, if the compaction has moved a directory entry to
the same offset as one of the diradd structures on the pd_pendinghd list
*and* a syscall is done that tries to remove this directory entry before
this directory block has been written to disk (which would empty
pd_pendinghd), a sanity check in newdirrem() will call panic() when it
notices that the inode number in the entry that it is to be removed doesn't
match the inode number in the diradd structure with that offset of that
entry.

Reviewed by: Kirk McKusick <mckusick@McKusick.COM>
Submitted by: Don Lewis <Don.Lewis@tsc.tdk.com>


39796 30-Sep-1998 mckusick

Do not allow a mounted on directory to be rmdir'ed. This removal can
happen when an NFS exported filesystem tries to remove a locally
mounted on directory.
PR: kern/7272
Submitted by: Andre Albsmeier <andre.albsmeier@mchp.siemens.de>


39669 26-Sep-1998 bde

Fixed clean flag handling:
- don't set the clean flag on unmount of an unclean filesystem that was
(forcibly) mounted rw.
- set the clean flag on rw -> ro update of a mounted initially-clean
filesystem.
- fixed some style bugs (mostly long lines).

This uses the fs_flags field and FS_UNCLEAN state bit which were
introduced in the softdep changes. NetBSD uses extra state bits in
fs_clean.

Reviewed by: luoqui


39623 24-Sep-1998 luoqi

Eliminate a race in VOP_FSYNC() when softupdates is enabled.
Submitted by: Kirk McKusick <mckusick@McKusick.COM>
Two minor changes are also included,
1. Remove gratuitious checks for error return from vn_lock with LK_RETRY set,
vn_lock should always succeed in these cases.
2. Back out change rev. 1.36->1.37, which unnecessarily makes async mount
a little more unstable. It also keeps us in sync with other BSDs.
Suggested by: Bruce Evans <bde@zeta.org.au>


39281 15-Sep-1998 luoqi

Restore pre-v1.44 behavior: always copy modified in-core inode to disk
buffer. Otherwise some in-core inode changes might be lost, including
important meta data (e.g. size) if softupdates is enabled.


39238 15-Sep-1998 gibbs

When a buffer is removed from a buffer queue, remember it's block number
and use it as "the currently active" buffer in doing disk sort calculations.


39187 14-Sep-1998 sos

Remove the SLICE code.
This clearly needs alot more thought, and we dont need this to hunt
us down in 3.0-RELEASE.


39099 12-Sep-1998 bde

Don't dereference an uninitialized pointer in dead code. The dead
code gets executed if it is compiled without optimization.


38909 07-Sep-1998 bde

Removed statically configured mount type numbers (MOUNT_*) and all
references to them.

The change a couple of days ago to ignore these numbers in statically
configured vfsconf structs was slightly premature because the cd9660,
cfs, devfs, ext2fs, nfs vfs's still used MOUNT_* instead of the number
in their vfsconf struct.


38907 07-Sep-1998 bde

Put the zombie ffs sysctl node in "notyet" state together with its few
remaining children. Prepare it for MOUNT_UFS going away.


38902 07-Sep-1998 phk

Make MFS do the default on VOP_FREEBLKS().

XXX: we could deallocate the storage, but somebody else will
have to pick up that task.


38862 05-Sep-1998 phk

Add a new vnode op, VOP_FREEBLKS(), which filesystems can use to inform
device drivers about sectors no longer in use.

Device-drivers receive the call through d_strategy, if they have
D_CANFREE in d_flags.

This allows flash based devices to erase the sectors and avoid
pointlessly carrying them around in compactions.

Reviewed by: Kirk Mckusick, bde
Sponsored by: M-Systems (www.m-sys.com)


38418 18-Aug-1998 bde

Quick fix for breakage of read clustering on non-IDE drives. Read
clustering is obsolescent technology so hardly anyone noticed. On
a DORS 32160 SCSI drive with 4 tags, read clustering makes very
little difference even for huge sequential reads. However, on a
ZIP SCSI drive with 0 tags, the minimum overhead per block is about
40 msec, so very large clusters must be used to get anywhere near
the maximum transfer rate. Using clusters consisting of 1 8K block
reduces the transfer rate to about 250K/sec. Under msdosfs, missing
read clustering is normal and a cluster size of 1 512 byte block
reduces the transfer rate to about 25K/sec.

Broken in: rev.1.18


38408 17-Aug-1998 bde

Removed unused includes.


38292 12-Aug-1998 msmith

"The releaseing of the reference and lock is not temporary and belongs
where it is. The reference and lock(s) are acquired just above the
code in VREF() and relookup()."

Submitted by: Michael Hancock <michaelh@cet.co.jp>


38291 12-Aug-1998 julian

Handle the case of moving a directory onto the top of a sibling's
child of the same name.

Submitted by: Kirk Mckusick with fixes from luoqi Chen
Obtained from: Whistle test tree.


37922 28-Jul-1998 bde

Used daddr_t's, not ints, to store disk block numbers. Updated printf
formats and args to match. Fixed old printf format errors (all related;
most were hidden by calling printf indirectly).

This change somehow avoids compiler bugs for 64-bit longs on i386's,
although it increases the number of 64-bit calculations.


37887 27-Jul-1998 bde

Made lazy syncing of timestamps for special files non-optional.


37649 15-Jul-1998 bde

Cast pointers to uintptr_t/intptr_t instead of to u_long/long,
respectively. Most of the longs should probably have been
u_longs, but this changes is just to prevent warnings about
casts between pointers and integers of different sizes, not
to fix poorly chosen types.


37555 11-Jul-1998 bde

Fixed printf format errors.


37539 10-Jul-1998 julian

Add code missed in the initial Soft updates integration.
Make the unallocated parts of a directry have a know state
in case we need it later.


37520 08-Jul-1998 julian

Don't update superblock if mounted readonly,
also fixes some problems with softupdates on root.
More cleanups are needed here..
Submitted by: Luoqi Chen <luoqi@watermarkgroup.com>


37490 08-Jul-1998 julian

Catch a few corner cases where FreeBSD differs enough from BSD 4.4 to
confuse Soft updates..
Should solve several "dangling deps" panics.


37384 04-Jul-1998 julian

VOP_STRATEGY grows an (struct vnode *) argument
as the value in b_vp is often not really what you want.
(and needs to be frobbed). more cleanups will follow this.
Reviewed by: Bruce Evans <bde@freebsd.org>


37364 03-Jul-1998 bde

Restored revs.1.89-1.90 which I somehow clobbered in rev.1.91.


37363 03-Jul-1998 bde

Sync timestamp changes for inodes of special files to disk as late
as possible (when the inode is reclaimed). Temporarily only do
this if option UFS_LAZYMOD configured and softupdates aren't enabled.
UFS_LAZYMOD is intentionally left out of /sys/conf/options.

This is mainly to avoid almost useless disk i/o on battery powered
machines. It's silly to write to disk (on the next sync or when the
inode becomes inactive) just because someone hit a key or something
wrote to the screen or /dev/null.

PR: 5577
Previous version reviewed by: phk


37362 03-Jul-1998 bde

Centralized in-core inode update. Update the in-core inode directly
in ufs_setattr() so that there is no need to pass timestamps to
UFS_UPDATE() (everything else just needs the current time). Ignore
the passed-in timestamps in UFS_UPDATE() and always call ufs_itimes()
(was: itimes()) to do the update. The timestamps are still passed
so that all the callers don't need to be changed yet.


37182 27-Jun-1998 phk

Make vprint() print dev_t in hex also.


37181 27-Jun-1998 phk

Report the type from the inode, not the vnode.


37167 26-Jun-1998 jkh

Flesh this document out just a little in response to some user
questions and also recommend linking over copying since, at this stage,
a stale copy is a real concern.


37094 21-Jun-1998 bde

Removed unused includes.


36990 14-Jun-1998 julian

Slight change to directory cleanup
Makes soft updates a bit cleaner. Eliminates some warnings about
'corrupted directories' from fsck.


36936 12-Jun-1998 julian

Note which version of Kirk's sources this corresponds to.


36935 12-Jun-1998 julian

Fix the case when renaming to a file that you've just created and deleted,
that had an inode that has not yet been written to disk, when the inode of the
new file is also not yet written to disk, and your old directory entry is not
yet on disk but you need to remove it and the new name exists in memory
but has been deleted but the transaction to write the deleted name to disk
exists and has not yet been cancelled by the request to delete the non
existant name. I don't know how kirk could have missed such a glaring
problem for so long. :-) Especially since the inconsitency survived on
the disk for a whole 4 second on average before being fixed by other code.
This was not a crashing bug but just led to filesystem inconsitencies
if you crashed.

Submitted by: Kirk McKusick (mckusick@mckusick.com)


36900 11-Jun-1998 julian

Add B_NOCACHE to several cases where BSD4.4 only required a B_INVAL.
Change worked out by john and kirk in consort.


36871 10-Jun-1998 julian

Fix for "live inode" panic.
Submitted by: Kirk McKusick <mckusick@McKusick.COM>
Reviewed by: yeah right...


36866 10-Jun-1998 julian

Remove buggy debugging code.


36863 10-Jun-1998 julian

Back out John's changes 1.45 -> 1.46
Kirk confirms that the original semantic was what he wanted...
(well, a very slight difference)
May fix "dangling deps" panic with soft updates.


36779 08-Jun-1998 julian

The version of the softdep changes in FreeBSD broke the
(doingdirectory && !newparent) case of ufs_rename().
rename("D1/X/", "D2/Y/") gives a wrong link count for D2.

Submitted by: Bruce Evans <bde@zeta.org.au>
Reviewed by: Kirk McKusick <mckusick@McKusick.COM>


36723 07-Jun-1998 bde

Null change. Forgot to mention in previous log message that MNT_NOATIME
is now ignored for special files, so that mounting root with option
noatime doesn't break reporting of idle times in programs like `w'.
The problem of execessive disk updates just to stamp atimes will be
handled for special files by only writing atimes to disk when inodes
become active. This works well because special files are relatively
uncommon and their atimes are even more disposable at panic time than
regular files' atimes.


36721 07-Jun-1998 bde

Fixed some longstanding timestamp bugs:
1. mark atimes and mtimes of special files and fifos for update upon
successful completion of non-null i/o, not at the beginning of the
syscall.
2. never update file times for readonly filesystems. They were updated
for stats and closes but not for syncs. The updates were of course
only in-core and were thrown away when the inode was uncached, so
the times sometimes appeared to go backwards.

Improved comments in code related to (1) (mostly by removing them).

Unmacroized ITIMES(). The test in (2) bloated it even more. Don't
call getmicrotime() in the function version of it when we only need
the time in seconds.


36646 04-Jun-1998 dfr

Use size_t instead of u_int for sizes.


36645 04-Jun-1998 dfr

If the filesystem blocksize is less than the VM page size, use the generic
getpages code. This happens for filesystems with 4k pages on the alpha since
the normal alpha pagesize is 8k.


36644 04-Jun-1998 dfr

Don't cast a pointer to an int in DQHASH.


36581 02-Jun-1998 julian

Add a reference to the original softupdates paper


36580 02-Jun-1998 julian

Add a reference to the Ganger/Patt paper


36404 27-May-1998 julian

A fix to a debug test from Kirk.


36235 19-May-1998 julian

Ensure that there is enough information here, so that people can use
soft updates should they desire.


36234 19-May-1998 julian

Bring up-to-date with Whistle's current version
Includes some debugging code.


36232 19-May-1998 julian

Merge with Kirk's version as of Feb 20

His version 9.23 == our version 1.5 of ffs_softdep.c
His version 9.5 == our version 1.4 of softdep.c


36225 19-May-1998 julian

Merge in Kirk's changes to stop softupdates from hogging all of memory.


36212 19-May-1998 julian

Change to stop a silly panic. This should be understood better.
Change a buffer swizzle trick to a bcopy. It would be nice if the efficient
trick could be used in the future.


36210 19-May-1998 julian

First published FreeBSD version of soft updates Feb 5.


36207 19-May-1998 julian

This commit was generated by cvs2svn to compensate for changes in r36206,
which included commits to RCS files with non-trunk default branches.


36202 19-May-1998 julian

This commit was generated by cvs2svn to compensate for changes in r36201,
which included commits to RCS files with non-trunk default branches.


36147 18-May-1998 julian

try stop the user from using mount -u to set the async flag on
a filesystem currently using soft updates.
Also needs a new copy of ffs_softdep.c to complete the fix.


36119 17-May-1998 phk

s/nanoruntime/nanouptime/g
s/microruntime/microuptime/g

Reviewed by: bde


35955 11-May-1998 julian

Add missing splx()

Submitted by: Luoqi Chen <luoqi@chen.ml.org>


35954 11-May-1998 julian

Submitted by: abial@nask.pl
Minor fix to support SLICE in MFS...


35823 07-May-1998 msmith

In the words of the submitter:

---------
Make callers of namei() responsible for releasing references or locks
instead of having the underlying filesystems do it. This eliminates
redundancy in all terminal filesystems and makes it possible for stacked
transport layers such as umapfs or nullfs to operate correctly.

Quality testing was done with testvn, and lat_fs from the lmbench suite.

Some NFS client testing courtesy of Patrik Kudo.

vop_mknod and vop_symlink still release the returned vpp. vop_rename
still releases 4 vnode arguments before it returns. These remaining cases
will be corrected in the next set of patches.
---------

Submitted by: Michael Hancock <michaelh@cet.co.jp>


35769 06-May-1998 msmith

As described by the submitter:

Reverse the VFS_VRELE patch. Reference counting of vnodes does not need
to be done per-fs. I noticed this while fixing vfs layering violations.
Doing reference counting in generic code is also the preference cited by
John Heidemann in recent discussions with him.

The implementation of alternative vnode management per-fs is still a valid
requirement for some filesystems but will be revisited sometime later,
most likely using a different framework.

Submitted by: Michael Hancock <michaelh@cet.co.jp>


35696 04-May-1998 dyson

Correct an error that I made where the vtruncbuf was changed back
to vinvalbuf, but I incorrectly added the "V_SAVE|V_SAVEMETA" flags.
Submitted by: Luoqi Chen <luoqi@watermarkgroup.com>


35526 30-Apr-1998 dyson

Fix an error that I made with an optimization. In the case
of softupdates, we need to do vtruncbuf the old way. Luoqi
caught, found the bug and submitted this fix.
Submitted by: Luoqi Chen <luoqi@chen.ml.org>


35323 20-Apr-1998 julian

Make the devfs SLICE option a standard type option.
(hopefully it will go away eventually anyhow)


35319 19-Apr-1998 julian

Add changes and code to implement a functional DEVFS.
This code will be turned on with the TWO options
DEVFS and SLICE. (see LINT)
Two labels PRE_DEVFS_SLICE and POST_DEVFS_SLICE will deliniate these changes.

/dev will be automatically mounted by init (thanks phk)
on bootup. See /sys/dev/slice/slice.4 for more info.
All code should act the same without these options enabled.

Mike Smith, Poul Henning Kamp, Soeren, and a few dozen others

This code does not support the following:
bad144 handling.
Persistance. (My head is still hurting from the last time we discussed this)
ATAPI flopies are not handled by the SLICE code yet.

When this code is running, all major numbers are arbitrary and COULD
be dynamically assigned. (this is not done, for POLA only)
Minor numbers for disk slices ARE arbitray and dynamically assigned.


35256 17-Apr-1998 des

Seventy-odd "its" / "it's" typos in comments fixed as per kern/6108.


35205 15-Apr-1998 bde

Fixed bitrot in the non-softdep case of ufs_dirremove():
- restored async mount support. The first entry in a block is still
always written synchronously, although it probably shouldn't be in
the async case.
- restored use of BWRITE() instead of bowrite() for the DOWHITEOUT
case, although bowrite() is probably better.

Broken by: merge of softdep changes (rev.1.22).
Found by: lmbench2 delete-file benchmarks.


35084 06-Apr-1998 peter

Back this out, allowing users to get a fd connected to a symlink is
just too dangerous.


35083 06-Apr-1998 peter

Don't panic if a VOP_READ() gets through on a short link, Just Do It
(because we can :-). This means you can open a link file (or pseudo-file
in the case of short links where the data is stored in the inode rather
than disk blocks) and read the contents.
However, trap any writes from the user as it's difficult to do the right
thing in all cases. A link may be short and the user may be trying to
extend it beyond the limit and so on. Although.. being able to re-target
a symlink without deleting it first might have been nice.
This stuff is a bit perverse since symlink() and readlink() calls can
end up actually being implemented as read/write vnode ops.

Reviewed by: phk


35029 04-Apr-1998 phk

Time changes mark 2:

* Figure out UTC relative to boottime. Four new functions provide
time relative to boottime.

* move "runtime" into struct proc. This helps fix the calcru()
problem in SMP.

* kill mono_time.

* add timespec{add|sub|cmp} macros to time.h. (XXX: These may change!)

* nanosleep, select & poll takes long sleeps one day at a time

Reviewed by: bde
Tested by: ache and others


34961 30-Mar-1998 phk

Eradicate the variable "time" from the kernel, using various measures.
"time" wasn't a atomic variable, so splfoo() protection were needed
around any access to it, unless you just wanted the seconds part.

Most uses of time.tv_sec now uses the new variable time_second instead.

gettime() changed to getmicrotime(0.

Remove a couple of unneeded splfoo() protections, the new getmicrotime()
is atomic, (until Bruce sets a breakpoint in it).

A couple of places needed random data, so use read_random() instead
of mucking about with time which isn't random.

Add a new nfs_curusec() function.

Mark a couple of bogosities involving the now disappeard time variable.

Update ffs_update() to avoid the weird "== &time" checks, by fixing the
one remaining call that passwd &time as args.

Change profiling in ncr.c to use ticks instead of time. Resolution is
the same.

Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call
hzto() which subtracts time" sequences.

Reviewed by: bde


34924 28-Mar-1998 bde

Moved some #includes from <sys/param.h> nearer to where they are actually
used.


34913 27-Mar-1998 peter

Enable the use of soft updates on the root filesystem. Previously, the
softdep mode could only be activated on the initial mount of a filesystem
and then only if it was a read-write mount. A 'mount -r' (as done in the
rootfs mount) followed by a 'mount -u' to convert to read-write didn't
start softdep mode.


34901 26-Mar-1998 phk

Add two new functions, get{micro|nano}time.

They are atomic, but return in essence what is in the "time" variable.
gettime() is now a macro front for getmicrotime().

Various patches to use the two new functions instead of the various
hacks used in their absence.

Some puntuation and grammer patches from Bruce.

A couple of XXX comments.


34826 23-Mar-1998 bde

Forward declare even more structs to restore some self-sufficiency.
Didn't fix new dependence on <ufs/ufs/inode.h> and its prerequisites.


34734 21-Mar-1998 dyson

Softdep_sync_metadata appears to expect that it is called at splbio,
so make it so...


34696 19-Mar-1998 dyson

Fix vfs_bio_awrite usage, and correct vtruncbuf usage.


34611 16-Mar-1998 dyson

Some VM improvements, including elimination of alot of Sig-11
problems. Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!

pmap.c:
1) Create an object for kernel page table allocations. This
fixes a bogus allocation method previously used for such, by
grabbing pages from the kernel object, using bogus pindexes.
(This was a code cleanup, and perhaps a minor system stability
issue.)

pmap.c:
2) Pre-set the modify and accessed bits when prudent. This will
decrease bus traffic under certain circumstances.

vfs_bio.c, vfs_cluster.c:
3) Rather than calculating the beginning virtual byte offset
multiple times, stick the offset into the buffer header, so
that the calculated offset can be reused. (Long long multiplies
are often expensive, and this is a probably unmeasurable performance
improvement, and code cleanup.)

vfs_bio.c:
4) Handle write recursion more intelligently (but not perfectly) so
that it is less likely to cause a system panic, and is also
much more robust.

vfs_bio.c:
5) getblk incorrectly wrote out blocks that are incorrectly sized.
The problem is fixed, and writes blocks out ONLY when B_DELWRI
is true.

vfs_bio.c:
6) Check that already constituted buffers have fully valid pages. If
not, then make sure that the B_CACHE bit is not set. (This was
a major source of Sig-11 type problems.)

vfs_bio.c:
7) Fix a potential system deadlock due to an incorrectly specified
sleep priority while waiting for a buffer write operation. The
change that I made opens the system up to serious problems, and
we need to examine the issue of process sleep priorities.

vfs_cluster.c, vfs_bio.c:
8) Make clustered reads work more correctly (and more completely)
when buffers are already constituted, but not fully valid.
(This was another system reliability issue.)

vfs_subr.c, ffs_inode.c:
9) Create a vtruncbuf function, which is used by filesystems that
can truncate files. The vinvalbuf forced a file sync type operation,
while vtruncbuf only invalidates the buffers past the new end of file,
and also invalidates the appropriate pages. (This was a system reliabiliy
and performance issue.)

10) Modify FFS to use vtruncbuf.

vm_object.c:
11) Make the object rundown mechanism for OBJT_VNODE type objects work
more correctly. Included in that fix, create pager entries for
the OBJT_DEAD pager type, so that paging requests that might slip
in during race conditions are properly handled. (This was a system
reliability issue.)

vm_page.c:
12) Make some of the page validation routines be a little less picky
about arguments passed to them. Also, support page invalidation
change the object generation count so that we handle generation
counts a little more robustly.

vm_pageout.c:
13) Further reduce pageout daemon activity when the system doesn't
need help from it. There should be no additional performance
decrease even when the pageout daemon is running. (This was
a significant performance issue.)

vnode_pager.c:
14) Teach the vnode pager to handle race conditions during vnode
deallocations.


34441 09-Mar-1998 dyson

Correct a problem with the ffs_getpages routine that manifest's itself
during the tail command. The amount to read is incorrectly calculated.
Submitted by: Tor Egge


34266 08-Mar-1998 julian

Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman)
Submitted by: Kirk McKusick (mcKusick@mckusick.com)
Obtained from: WHistle development tree


34248 08-Mar-1998 julian

Submitted by: kirk McKusick

Stub file for soft updates.


34206 07-Mar-1998 dyson

This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.

1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.


34184 07-Mar-1998 bde

Fixed missing simple_lock() in ffs_mountfs().


33964 01-Mar-1998 msmith

The intent is to get rid of WILLRELE in vnode_if.src by making
a complement to all ops that return a vpp, VFS_VRELE. This is
initially only for file systems that implement the following ops
that do a WILLRELE:

vop_create, vop_whiteout, vop_mknod, vop_remove, vop_link,
vop_rename, vop_mkdir, vop_rmdir, vop_symlink

This is initial DNA that doesn't do anything yet. VFS_VRELE is
implemented but not called.

A default vfs_vrele was created for fs implementations that use the
standard vnode management routines.

VFS_VRELE implementations were made for the following file systems:

Standard (vfs_vrele)
ffs mfs nfs msdosfs devfs ext2fs

Custom
union umapfs

Just EOPNOTSUPP
fdesc procfs kernfs portal cd9660

These implementations may change as VOP changes are implemented.

In the next phase, in the vop implementations calls to vrele and the vrele
part of vput will be moved to the top layer vfs_vnops and made visible
to all layers. vput will be replaced by unlock in these cases. Unlocking
will still be done in the per fs layer but the refcount decrement will be
triggered at the top because it doesn't hurt to hold a vnode reference a
little longer. This will have minimal impact on the structure of the
existing code.

This will only be done for vnode arguments that are released by the various
fs vop implementations.

Wider use of VFS_VRELE will likely require restructuring of the code.

Reviewed by: phk, dyson, terry et. al.
Submitted by: Michael Hancock <michaelh@cet.co.jp>


33847 26-Feb-1998 msmith

In the author's words:

These diffs implement the first stage of a VOP_{GET|PUT}PAGES pushdown
for local media FS's.

See ffs_putpages in /sys/ufs/ufs/ufs_readwrite.c for implementation
details for generic *_{get|put}pages for local media FS's. Support
is trivial to add for any FS that formerly relied on the default
behaviour of the vnode_pager in in EOPNOTSUPP cases (just copy the
ffs_getpages() code for the FS in question's *_{get|put}pages).

Obviously, it would be better if each local media FS implemented a
more optimal method, instead of calling an exported interface from
the /sys/vm/vnode_pager.c, but this is a necessary first step in
getting the FS's to a point where they can be supplied with better
implementations on a case-by-case basis.

Obviously, the cd9660_putpages() can be rather trivial (since it
is a read-only FS type 8-)).

A slight (temporary) modification is made to print a diagnostic message
in the case where the underlying filesystem attempts to engage in the
previous behaviour. Failure is likely to be ungraceful.

Submitted by: terry@freebsd.org (Terry Lambert)


33820 25-Feb-1998 bde

Fixed missing permissions checking for mounting by non-root.

There is now less need for the vfs.usermount sysctl. msdosfs already
has this change, modulo a missing LK_RETRY, via NetBSD. At least
ext2fs is missing this and many other changes from Lite2.

Obtained from: Lite2


33678 20-Feb-1998 bde

Don't depend on "implicit int".


33443 16-Feb-1998 msmith

Fix a panic resulting from executing off an MFS image. This corrects the
recently observed problem with the install image.
Submitted by: Tor Egge <Tor.Egge@idi.ntnu.no>


33289 13-Feb-1998 bde

Removed unnecessary dependencies on KERNEL and DIAGNOSTIC. This was
more useful when opt_diagnostic.h had to be included.


33181 09-Feb-1998 eivind

Staticize.


33134 06-Feb-1998 eivind

Back out DIAGNOSTIC changes.


33109 05-Feb-1998 dyson

1) Start using a cleaner and more consistant page allocator instead
of the various ad-hoc schemes.
2) When bringing in UPAGES, the pmap code needs to do another vm_page_lookup.
3) When appropriate, set the PG_A or PG_M bits a-priori to both avoid some
processor errata, and to minimize redundant processor updating of page
tables.
4) Modify pmap_protect so that it can only remove permissions (as it
originally supported.) The additional capability is not needed.
5) Streamline read-only to read-write page mappings.
6) For pmap_copy_page, don't enable write mapping for source page.
7) Correct and clean-up pmap_incore.
8) Cluster initial kern_exec pagin.
9) Removal of some minor lint from kern_malloc.
10) Correct some ioopt code.
11) Remove some dead code from the MI swapout routine.
12) Correct vm_object_deallocate (to remove backing_object ref.)
13) Fix dead object handling, that had problems under heavy memory load.
14) Add minor vm_page_lookup improvements.
15) Some pages are not in objects, and make sure that the vm_page.c can
properly support such pages.
16) Add some more page deficit handling.
17) Some minor code readability improvements.


33108 04-Feb-1998 eivind

Turn DIAGNOSTIC into a new-style option.


33054 03-Feb-1998 bde

Forward declare some structs so that this file is more self-sufficient.


32976 01-Feb-1998 dyson

Back out recent laptop sync changes. They had significant errors.


32951 01-Feb-1998 dyson

Support more intelligent sync operations for MNT_NOATIME.
PR: kern/5577
Submitted by: Craig Leres <leres@ee.lbl.gov>


32944 31-Jan-1998 julian

Serves me right for not puting SUIDDIR in LINT. it got bitrot.
This should stop complaints about it not working for people.


32889 30-Jan-1998 phk

Retire LFS.

If you want to play with it, you can find the final version of the
code in the repository the tag LFS_RETIREMENT.

If somebody makes LFS work again, adding it back is certainly
desireable, but as it is now nobody seems to care much about it,
and it has suffered considerable bitrot since its somewhat haphazard
integration.

R.I.P


32726 24-Jan-1998 eivind

Make all file-system (MFS, FFS, NFS, LFS, DEVFS) related option new-style.

This introduce an xxxFS_BOOT for each of the rootable filesystems.
(Presently not required, but encouraged to allow a smooth move of option *FS
to opt_dontuse.h later.)

LFS is temporarily disabled, and will be re-enabled tomorrow.


32724 24-Jan-1998 dyson

Add better support for larger I/O clusters, including larger physical
I/O. The support is not mature yet, and some of the underlying implementation
needs help. However, support does exist for IDE devices now.


32702 22-Jan-1998 dyson

VM level code cleanups.

1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.

This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)

This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)


32585 17-Jan-1998 dyson

Tie up some loose ends in vnode/object management. Remove an unneeded
config option in pmap. Fix a problem with faulting in pages. Clean-up
some loose ends in swap pager memory management.

The system should be much more stable, but all subtile bugs aren't fixed yet.


32286 06-Jan-1998 dyson

Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.

When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.

When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.

A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.

Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.


32161 01-Jan-1998 bde

Removed unused #includes again. They thrashed when mfs_reclaim thrashed
to ufs_reclaim and back.


32072 29-Dec-1997 dyson

Fix the decl of vfs_ioopt, allow LFS to compile again, fix a minor problem
with the object cache removal.


32071 29-Dec-1997 dyson

Lots of improvements, including restructring the caching and management
of vnodes and objects. There are some metadata performance improvements
that come along with this. There are also a few prototypes added when
the need is noticed. Changes include:

1) Cleaning up vref, vget.
2) Removal of the object cache.
3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore.
4) Correct some missing LK_RETRY's in vn_lock.
5) Correct the page range in the code for msync.

Be gentle, and please give me feedback asap.


32011 27-Dec-1997 bde

Unspammed nested include of <vm/vm_zone.h>.


31920 21-Dec-1997 dyson

I added vfs_ioopt prematurely, disabled.


31853 19-Dec-1997 dyson

Some performance improvements, and code cleanups (including changing our
expensive OFF_TO_IDX to btoc whenever possible.)


31788 16-Dec-1997 eivind

Make LINT compile again after wollman introduced poll() here.

Overlooked by: wollman


31749 15-Dec-1997 eivind

Convert SUIDDIR fully to a new-style option.

Forgotten by: julian


31727 15-Dec-1997 wollman

Add support for poll(2) on files. vop_nopoll() now returns POLLNVAL
if one of the new poll types is requested; hopefully this will not break
any existing code. (This is done so that programs have a dependable
way of determining whether a filesystem supports the extended poll types
or not.)

The new poll types added are:

POLLWRITE - file contents may have been modified
POLLNLINK - file was linked, unlinked, or renamed
POLLATTRIB - file's attributes may have been changed
POLLEXTEND - file was extended

Note that the internal operation of poll() means that it is impossible
for two processes to reliably poll for the same event (this could
be fixed but may not be worth it), so it is not possible to rewrite
`tail -f' to use poll at this time.


31699 13-Dec-1997 bde

Restored ufs_pathconf() from rev.1.61. vop_stdpathconf() is too
general to be of much use. Using it here broke the _PC_NAME_MAX,
_PC_NO_TRUNC and _PC_PATH_MAX cases, and weakened the _PC_MAX_CANON,
_PC_MAX_INPUT and _PC_VDISABLE cases.


31683 12-Dec-1997 peter

Fix(?) some style consistancy breakage and do some other nit-picking on
the SUIDDIR changes.


31561 05-Dec-1997 bde

Don't include <sys/lock.h> in headers when only `struct simplelock' is
required. Fixed everything that depended on the pollution.


31557 05-Dec-1997 jkh

Needs to include <sys/lock.h> if we're using struct lock.


31493 02-Dec-1997 phk

In all such uses of struct buf: 's/b_un.b_addr/b_data/g'


31486 02-Dec-1997 bde

`nextgennumber' can go away now that is no longer (ab)used by foreign
fs's.


31485 02-Dec-1997 bde

Use the same algorithm as ffs for generation numbers.


31484 02-Dec-1997 bde

Fix a small style bug in the generation number change (rev.1.33) before
copying the change to other fs's.


31394 24-Nov-1997 bde

Fixed overflow in ufs_getblns(). For ufs on systems with 32-bit ints,
triple indirect blocks only worked for block sizes of 4K, since
MNINDIR(ump)**3 overflows for larger block sizes (e.g.,
(8192/4)**3 = 2**33 > INT_MAX). This fix is not the obvious one of
changing some types to 64 bits. It rearranges the code to avoid some
unnecessary 64-bit calculations.

Reviewed by: Kirk McKusick <mckusick@McKusick.COM>


31352 22-Nov-1997 bde

Staticized.


31351 22-Nov-1997 bde

Unremoved prtrealloc and the declaration of ffs_clusteralloc(). These
are used in the `#ifdef notyet' case :-). This case is used except in
the `#if !defined (not_yes)' case :-|. This has something to do with
the `#ifdef notyet_block_reallocation_enabled' case in vfs_cluster.c :-(.


31312 20-Nov-1997 bde

Fixed marking of access time for special files and fifos (don't do
it if the file system is mounted noatime). Not fixed: the access
time is marked at the start of a read() and not marked on successful
completion. I think this should be handled at the vfs level.

Print a better panic message for missing vops. Don't use printf()
before panic(), since the printf()ed part isn't shown by gdb.
This actually loses a little with the current gdb, since gdb just
prints the fmt arg to panic, so %'s aren't expanded. gdb should
fetch the full message from the message buffer if possible.

Fixed default vop function for vop_getpages_desc. It needs to
just return EOPNOTSUPP so that the vnode pager can get the pages
in using a general method. Panicing broke exec'ing of files on
ext2fs file systems. ffs works because it doesn't use the default.

Fixed nearby style bugs.


31274 18-Nov-1997 bde

Removed an unused #include in the `#ifdef KERNEL' case.

Fixed a comment to match the code. The code is still wrong
(ffs_checkoverlap() should be staticized and called from a
ddb command).


31269 18-Nov-1997 phk

unifdef -UEXT2FS


31147 13-Nov-1997 julian

oops, fix left out semicolon in code I patched by hand.


31144 13-Nov-1997 julian

Reviewed by: hackers@freebsd.org in general
Obtained from: Whistle Communications tree

Add an option to the way UFS works dependent on the SUID bit of directories
This changes makes things a whole lot simpler on systems running as
fileservers for PCs and MACS. to enable the new code you must
1/ enable option SUIDDIR on the kernel.
2/ mount the filesystem with option suiddir.
hopefully this makes it difficult enough for people to
do this accidentally.
see the new chmod(2) man page for detailed info.


31132 12-Nov-1997 julian

Reviewed by: various.

Ever since I first say the way the mount flags were used I've hated the
fact that modes, and events, internal and exported, and short-term
and long term flags are all thrown together. Finally it's annoyed me enough..
This patch to the entire FreeBSD tree adds a second mount flag word
to the mount struct. it is not exported to userspace. I have moved
some of the non exported flags over to this word. this means that we now
have 8 free bits in the mount flags. There are another two that might
well move over, but which I'm not sure about.
The only user visible change would have been in pstat -v, except
that davidg has disabled it anyhow.
I'd still like to move the state flags and the 'command' flags
apart from each other.. e.g. MNT_FORCE really doesn't have the
same semantics as MNT_RDONLY, but that's left for another day.


31016 07-Nov-1997 phk

Remove a bunch of variables which were unused both in GENERIC and LINT.

Found by: -Wunused


30994 06-Nov-1997 phk

Move the "retval" (3rd) parameter from all syscall functions and put
it in struct proc instead.

This fixes a boatload of compiler warning, and removes a lot of cruft
from the sources.

I have not removed the /*ARGSUSED*/, they will require some looking at.

libkvm, ps and other userland struct proc frobbing programs will need
recompiled.


30890 01-Nov-1997 tegge

Move declaration of M_MFSNODE from mfs_vnops.c to mfsnode.h.


30889 01-Nov-1997 tegge

Bring back mfs_reclaim(), which is used to reclaim the master vnode in MFS.


30780 27-Oct-1997 bde

Removed unused #includes. The need for most of them went away with
recent changes (docluster* and vfs improvements).


30779 27-Oct-1997 bde

Forward declare precisely the structs that are actually used in this header.


30743 26-Oct-1997 phk

VFS interior redecoration.

Rename vn_default_error to vop_defaultop all over the place.
Move vn_bwrite from vfs_bio.c to vfs_default.c and call it vop_stdbwrite.
Use vop_null instead of nullop.
Move vop_nopoll from vfs_subr.c to vfs_default.c
Move vop_sharedlock from vfs_subr.c to vfs_default.c
Move vop_nolock from vfs_subr.c to vfs_default.c
Move vop_nounlock from vfs_subr.c to vfs_default.c
Move vop_noislocked from vfs_subr.c to vfs_default.c
Use vop_ebadf instead of *_ebadf.
Add vop_defaultop for getpages on master vnode in MFS.


30608 20-Oct-1997 phk

I belive this fixes MFS after I broke it.


30563 19-Oct-1997 dyson

This might fix the mfs_badop problem left over with the cool VFS fixes.
PHK should check this.


30513 17-Oct-1997 phk

Make a set of VOP standard lock, unlock & islocked VOP operators, which
depend on the lock being located at vp->v_data. Saves 3x3 identical
vop procs, more as the other filesystems becomes lock aware.


30496 16-Oct-1997 phk

VFS clean up "hekto commit"

1. Add defaults for more VOPs
VOP_LOCK vop_nolock
VOP_ISLOCKED vop_noislocked
VOP_UNLOCK vop_nounlock
and remove direct reference in filesystems.

2. Rename the nfsv2 vnop tables to improve sorting order.


30492 16-Oct-1997 phk

Another VFS cleanup "kilo commit"

1. Remove VOP_UPDATE, it is (also) an UFS/{FFS,LFS,EXT2FS,MFS}
intereface function, and now lives in the ufsmount structure.

2. Remove VOP_SEEK, it was unused.

3. Add mode default vops:

VOP_ADVLOCK vop_einval
VOP_CLOSE vop_null
VOP_FSYNC vop_null
VOP_IOCTL vop_enotty
VOP_MMAP vop_einval
VOP_OPEN vop_null
VOP_PATHCONF vop_einval
VOP_READLINK vop_einval
VOP_REALLOCBLKS vop_eopnotsupp

And remove identical functionality from filesystems

4. Add vop_stdpathconf, which returns the canonical stuff. Use
it in the filesystems. (XXX: It's probably wrong that specfs
and fifofs sets this vop, shouldn't it come from the "host"
filesystem, for instance ufs or cd9660 ?)

5. Try to make system wide VOP functions have vop_* names.

6. Initialize the um_* vectors in LFS.

(Recompile your LKMS!!!)


30476 16-Oct-1997 phk

Staticize the ufs vnops member functions.


30475 16-Oct-1997 phk

Remove an overlapping variable I created in last round.

Don't do pointer subtraction on void *

Use VOP_STRATEGY instead of homegrown stuff.

Add an XXX warning for LFS freaks to ponder.


30474 16-Oct-1997 phk

VFS mega cleanup commit (x/N)

1. Add new file "sys/kern/vfs_default.c" where default actions for
VOPs go. Implement proper defaults for ABORTOP, BWRITE, LEASE,
POLL, REVOKE and STRATEGY. Various stuff spread over the entire
tree belongs here.

2. Change VOP_BLKATOFF to a normal function in cd9660.

3. Kill VOP_BLKATOFF, VOP_TRUNCATE, VOP_VFREE, VOP_VALLOC. These
are private interface functions between UFS and the underlying
storage manager layer (FFS/LFS/MFS/EXT2FS). The functions now
live in struct ufsmount instead.

4. Remove a kludge of VOP_ functions in all filesystems, that did
nothing but obscure the simplicity and break the expandability.
If a filesystem doesn't implement VOP_FOO, it shouldn't have an
entry for it in its vnops table. The system will try to DTRT
if it is not implemented. There are still some cruft left, but
the bulk of it is done.

5. Fix another VCALL in vfs_cache.c (thanks Bruce!)


30469 16-Oct-1997 julian

Two more places where root filesystems were mounted, put them at the head of
the mount list in case there is already DEVFS present.


30439 15-Oct-1997 phk

vnops megacommit

1. Use the default function to access all the specfs operations.
2. Use the default function to access all the fifofs operations.
3. Use the default function to access all the ufs operations.
4. Fix VCALL usage in vfs_cache.c
5. Use VOCALL to access specfs functions in devfs_vnops.c
6. Staticize most of the spec and fifofs vnops functions.
7. Make UFS panic if it lacks bits of the underlying storage handling.


30434 15-Oct-1997 phk

Hmm, realign the vnops into two columns.


30431 15-Oct-1997 phk

Stylistic overhaul of vnops tables.
1. Remove comment stating the blatantly obvious.
2. Align in two columns.
3. Sort all but the default element alphabetically.
4. Remove XXX comments pointing out entries not needed.


30428 15-Oct-1997 bde

IN_HASHED goes in the in-core flags ip->i_flag, not in the on-disk flags
ip->i_flags.

Rev.1.18 completely broke ufs. My root directory went away about 10
seconds after booting. I think file system damage was null, since
IN_HASHED = 0x80 is not used in the disk flags (it would probably
be UF_SOMETHING if it were used).


30419 14-Oct-1997 phk

Reset the flag right away, could catch a bogon someday.


30418 14-Oct-1997 phk

I think my previous change may have opened a race conditio.
This patch does the same thing, with no change in semantics.


30402 14-Oct-1997 phk

ufs_ihashrem() should not be called from the UFS layer, but from the
lower layer (LFS/FFS/?) like the rest of the ihash functions.
Otherwise it is impossible to make a lower layer that doesn't use the
ihash facility.


30354 12-Oct-1997 phk

Last major round (Unless Bruce thinks of somthing :-) of malloc changes.

Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.

A couple of finer points by: bde


30309 11-Oct-1997 phk

Distribute and statizice a lot of the malloc M_* types.

Substantial input from: bde


30285 10-Oct-1997 phk

Make ufs_reclaim free the underlying inode.


30284 10-Oct-1997 phk

Use generic ufs_reclaim().


30283 10-Oct-1997 phk

Add type arg to ffs_mountfs and avoid examining v_tag to find out
if MFS is getting a free ride.

Use generic ufs_reclaim().


29888 27-Sep-1997 kato

Clustered read and write are switched at mount-option level.

1. Clustered I/O is switched by the MNT_NOCLUSTERR and MNT_NOCLUSTERW
bits of the mnt_flag. The sysctl variables, vfs.foo.doclusterread
and vfs.foo.doclusterwrite are deleted. Only mount option can
control clustered I/O from userland.
2. When foofs_mount mounts block device, foofs_mount checks D_CLUSTERR
and D_CLUSTERW bits of the d_flags member in the block device switch
table. If D_NOCLUSTERR / D_NOCLUSTERW are set, MNT_NOCLUSTERR /
MNT_NOCLUSTERW bits will be set. In this case, MNT_NOCLUSTERR and
MNT_NOCLUSTERW cannot be cleared from userland.
3. Vnode driver disables both clustered read and write.
4. Union filesystem disables clutered write.

Reviewed by: bde


29725 22-Sep-1997 joerg

Make MFS a supported option, finally.


29685 21-Sep-1997 gibbs

Convert tqdisksort to bufqdisksort. Honor the B_ORDERED buffer flag
so that meta-data writes go out to the device in the right order.


29684 21-Sep-1997 gibbs

Update for new buffer queue data structure.


29653 21-Sep-1997 dyson

Change the M_NAMEI allocations to use the zone allocator. This change
plus the previous changes to use the zone allocator decrease the useage
of malloc by half. The Zone allocator will be upgradeable to be able
to use per CPU-pools, and has more intelligent usage of SPLs. Additionally,
it has reasonable stats gathering capabilities, while making most calls
inline.


29609 19-Sep-1997 phk

[Regarding the previous patch] This is completely wrong.

1. ffs_alloc() actually allowed writing one block less one frag (normally
7 frags or 7/8 blocks) beyond the limit.
2. freebufspace() gives the free space in frags, but `size' is in bytes,
so the change results in approximately `size' fragments too many being
reserved.
3. ffs_realloccg() has the same bug but wasn't changed.

PR: 3398
Submitted by: bde
Eyeballed by: phk


29581 18-Sep-1997 phk

Ffs_alloc allow users to write one block beyond the limit.

PR: 3398
Reviewed by: phk
Submitted by: Wolfram Schneider <wosch@apfel.de>


29362 14-Sep-1997 peter

Convert select -> poll.
Delete 'always succeed' select/poll handlers, replaced with generic call.
Flag missing vnode op table entries.


29287 10-Sep-1997 phk

Update the comment and remove checks now done centrally.


29208 07-Sep-1997 bde

Removed yet more vestiges of config-time swap configuration and/or
cleaned up nearby cruft.


29041 02-Sep-1997 bde

Removed unused #includes.


28954 31-Aug-1997 phk

Change the 0xdeadb hack to a flag called VDOOMED.
Introduce VFREE which indicates that vnode is on freelist.
Rename vholdrele() to vdrop().
Create vfree() and vbusy() to add/delete vnode from freelist.
Add vfree()/vbusy() to keep (v_holdcnt != 0 || v_usecount != 0)
vnodes off the freelist.
Generalize vhold()/v_holdcnt to mean "do not recycle".
Fix reassignbuf()s lack of use of vhold().
Use vhold() instead of checking v_cache_src list.
Remove vtouch(), the vnodes are always vget'ed soon enough
after for it to have any measuable effect.
Add sysctl debug.freevnodes to keep track of things.
Move cache_purge() up in getnewvnodes to avoid race.
Decrement v_usecount after VOP_INACTIVE(), put a vhold() on
it during VOP_INACTIVE()
Unmacroize vhold()/vdrop()
Print out VDOOMED and VFREE flags (XXX: should use %b)

Reviewed by: dyson


28787 26-Aug-1997 phk

Uncut&paste cache_lookup().

This unifies several times in theory indentical 50 lines of code.

The filesystems have a new method: vop_cachedlookup, which is the
meat of the lookup, and use vfs_cache_lookup() for their vop_lookup
method. vfs_cache_lookup() will check the namecache and pass on
to the vop_cachedlookup method in case of a miss.

It's still the task of the individual filesystems to populate the
namecache with cache_enter().

Filesystems that do not use the namecache will just provide the
vop_lookup method as usual.


28774 26-Aug-1997 dyson

Back out some incorrect changes that was worse than the original bug.


28701 25-Aug-1997 kato

Renamed doclusterread/write to unique names (ffs_doclusterread/write),
and staticize them. Move the #include of <sys/sysctl.h> to the top of
the file.

Pointed out by: Bruce Evans <bde@zeta.org.au>


28598 22-Aug-1997 dyson

Fix the "remove optimization" by removing it. Sorry for the trouble.


28558 22-Aug-1997 dyson

This is a trial improvement for the vnode reference count while on the vnode
free list problem. Also, the vnode age flag is no longer used by the
vnode pager. (It is actually incorrect to use then.) Constructive
feedback welcome -- just be kind.


28466 21-Aug-1997 dyson

Performance improvment to minimize delayed write output of files
that have been deleted.
Submitted by: Peter M. Chen <pmchen@eecs.umich.edu>


28270 16-Aug-1997 wollman

Fix all areas of the system (or at least all those in LINT) to avoid storing
socket addresses in mbufs. (Socket buffers are the one exception.) A number
of kernel APIs needed to get fixed in order to make this happen. Also,
fix three protocol families which kept PCBs in mbufs to not malloc them
instead. Delete some old compatibility cruft while we're at it, and add
some new routines in the in_cksum family.


27890 04-Aug-1997 phk

We got a couple of "map mismatch" panics from the following
code. According to the crash dump, bpref is set to 445
and cgp->cg_nclusterblks is 444. Hence in the for loop,
the test fails immediately but the following failure check
(got == cgp->cg_nclusterblks) doesn't trigger because got >
cgp->cg_nclusterblks. This wreaks havoc in the code after that.

Fix: Move one source bit to the left :-)

Noticed by: Mike Hibler <mike@fast.cs.utah.edu>
Submitted by: Kirk McKusick <mckusick@McKusick.COM>


27845 02-Aug-1997 bde

Removed unused #includes.


27378 13-Jul-1997 bde

Always mark st_ctime for update upon successful completion of
chown(). Previously, it wasn't marked for null chown()'s. We
permit null chown()s as a special case of "appropriate privilege"
- everyone has enough priviilege to not change ids (this is a better
argument than the one I gave for rev.1.13, that null changes aren't
really changes). However, POSIX.1 requires the update independently
of whether anything has changed.

Clear both the setuid and the setgid bits upon successful completion
of non-null chown()s by non-root. Previously, the setuid bit was
only changed for non-null changes of the uid, etc. POSIX.1 requires
clearing both unless the call was made by a process with "appropriate
privilege", in which case altering the bits is implementation-defined.
We define appropriate privilege as `process is root, or the change
is null', and the implementation-defined behaviour as not altering
the bits. There is no interpretation that permits clearing only
one of the bits.

Reviewed by: jdp


27377 13-Jul-1997 bde

Use the correct size for a sector in the search for a label in
readdisklabel(). Sectors may be larger than DEV_BSIZE.


27376 13-Jul-1997 bde

Removed semicolon from the end of a #define.


27375 13-Jul-1997 bde

Fixed comment about i_spare.


26664 15-Jun-1997 dyson

Fix a problem with the VN device. Specifically, the VN device can
cause a problem of spiraling death due to buffer resource limitations.
The vfs_bio code in general had little ability to handle buffer resource
management, and now it does. Also, there are a lot more knobs for tuning the
vfs_bio code now. The knobs came free because of the need that there
always be some immediately available buffers (non-delayed or locked) for
use. Note that the buffer cache code is much less likely to get bogged
down with lots of delayed writes, even more so than before.


26360 02-Jun-1997 julian

Submitted by: Whistle Communications (archie Cobbs)

These changes add the ability to specify that a UFS file/directory
cannot be unlinked. This is basically a scaled back version
of the IMMUTABLE flag. The reason is to allow an administrator
to create a directory hierarchy that a group of users
can arbitrarily add/delete files from, but that the hierarchy
itself is safe from removal by them.
If the NOUNLINK definition is set to 0
then this results in no change to what happens normally.
(and results in identical binary (in the kernel)).
It can be proven that if this bit is never set by the admin,
no new behaviour is introduced..
Several "good idea" comments from reviewers plus one grumble
about creeping featurism.

This code is in production in 2.2 based systems


26112 25-May-1997 peter

Fix warnings (from LINT). Missing static prototype, missing vm includes
for vnode_pager_setsize().


26001 22-May-1997 phk

Shrink struct inode by 20 bytes, so that malloc wastes less space.

Pointed out by: bde


25877 17-May-1997 phk

Remove redundant check for vp == dvp (done in VFS before calling).


25244 28-Apr-1997 jkh

Mount MFS read/write as in days of yore.


24775 10-Apr-1997 bde

Use smalllblktosize() instead of multiplying small block numbers
by fs->fs_bsize. The macro is usually faster and makes it clearer
that the multiplication can't overflow.


24477 01-Apr-1997 bde

Removed nested include of <ufs/ufs/dir.h>. Use the pre-Lite2 hack of
defining doff_t both here and in <ufs/ufs/dir.h> so that this file
is independent of <ufs/ufs/dir.h>. It still has old prerequisites
<sys/param.h> and <ufs/ufs/quota.h>, and a new Lite2 prerequisite of
<sys/lock.h>, sigh.

This might fix lsof, which was broken by namespace pollution giving
conflicting definitions of DIRBLKSIZ.


24438 31-Mar-1997 peter

Treat symlinks as first class citizens with their own uid/gid rather than
as shadows of their containing directory. This should solve the problem
of users not being able to delete their symlinks from /tmp once and for
all.

Symlinks do not have modes though, they are accessable to everything that
can read the directory (as before). They are made to show this fact at
lstat time (they appear as mode 0777 always, since that's how the the
lookup routines in the kernel treat them).

More commits will follow, eg: add a real lchown() syscall and man pages.


24203 24-Mar-1997 bde

Don't include <sys/ioctl.h> in the kernel. Stage 1: don't include
it when it is not used. In most cases, the reasons for including it
went away when the special ioctl headers became self-sufficient.


24171 24-Mar-1997 bde

Fixed corrupted newline and corrupted tab in previous commit.


24149 23-Mar-1997 guido

Add generation number randomization. Newly created filesystems wil now
automatically have random generation numbers. The kenel way of handling those
also changed. Further it is advised to run fsirand on all your nfs exported
filesystems. the code is mostly copied from OpenBSD, with the randomization
chanegd to use /dev/urandom
Reviewed by: Garrett
Obtained from: OpenBSD


24131 23-Mar-1997 bde

Don't #include <sys/fcntl.h> in <sys/file.h> if KERNEL is defined.
Fixed everything that depended on getting fcntl.h stuff from the wrong
place. Most things don't depend on file.h stuff at all.


24128 23-Mar-1997 bde

Merged the rest of lfs from Lite2. It compiles (uncleanly) but is as
unlikely to work as before.


24103 22-Mar-1997 bde

Merged enough of lfs from Lite2 for mkdep of LINT to work again.


24102 22-Mar-1997 bde

Removed `volatile' from declaration of `time', and removed the resulting
null casts. `time' is nonvolatile for accesses within a region locked
by splclock()/splx(). Accesses outside such a region are invalid, and
splx() must have the side effect of potentially changing all global
variables (since there are hundreds of sort of volatile variables like
`time'), so declaring `time' as volatile didn't have any real benefits.


24101 22-Mar-1997 bde

Fixed some invalid (non-atomic) accesses to `time', mostly ones of the
form `tv = time'. Use a new function gettime(). The current version
just forces atomicicity without fixing precision or efficiency bugs.
Simplified some related valid accesses by using the central function.


24098 22-Mar-1997 bde

Backed out rev.1.27, which broke unmounting of mfs and caused panics
on shutdown.

Should not have been in 2.2 (the buggy last minute change, that is).


23998 18-Mar-1997 peter

MAXDIRSIZE is (or would be) used in fsck. It's a sanity check.


23997 18-Mar-1997 peter

Restore the lost MNT_LOCAL flag twiddle. Lite2 has a different mechanism
of setting it (compiled into vfs_conf.c), but we have a dynamic system
in place. This could probably be better done via a runtime configure
flag in the VFS_SET() VFS declaration, perhaps VFCF_LOCAL, and have the
VFS code propagate this down into MNT_LOCAL at mount time. The other FS's
would need to be updated, havinf UFS and MSDOSFS filesystems without
MNT_LOCAL breaks a few things.. the man page rebuild scans for local
filesystems and currently fails, I suspect that other tools like find
and tar with their "local filesystem only" modes might be affected.


23908 15-Mar-1997 sos

Fix support for != 512 byte sector devices.
Restores the use of SBLOCK instead of the BSOFF/sectorsize calculation.
Using SBLOCK is bogus however in that it uses DEV_BSIZE instead of
the actual sector size, but that is taken care of in other places.
Changing the SBLOCK would be better, but it affects the system
in other places, and doing it this way makes it possible to
use filesystems that was made before the lite2 merge.


23562 09-Mar-1997 mpp

Update a number of routines to reflect the actual name
of the routine that caused the panic.


23560 09-Mar-1997 mpp

Update a number of panic messages to reflect the actual name
of the routine that caused the panic.


23388 05-Mar-1997 msmith

Supply the mount point given to mfs_mount when getting a vnode for the
mount. This may have been a contributor to the 'null v_mount in
fsync()' problem

This is another, perhaps slightly less urgent, 2.2 last-minute candidate.

Reviewed by: sef


23383 04-Mar-1997 bde

Fixed connection of vfs.ffs node to the sysctl tree.


23347 03-Mar-1997 bde

Removed unused flag IN_RECURSE and unused struct member i_lockcount.


23346 03-Mar-1997 bde

Removed useless setting of IN_RECURSE. The (anti) locking for this needs
to be done in a different way, if at all.


22975 22-Feb-1997 peter

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


22881 18-Feb-1997 bde

This now uses queue macros. Include <sys/queue.h> if !KERNEL to preserve
the documented interface.


22619 13-Feb-1997 bde

Removed FIFO ifdef again (see rev.1.5).


22579 12-Feb-1997 mpp

Add function prototypes for most of the new Lite2 functions.
Also made a few of the miscfs routines static to be
consistent. Some modules simply required some additional
#includes to remove -Wall warnings.


22544 10-Feb-1997 mpp

Correct the new Lite2 #ifdef DIAGNOSTIC ffs_checkblk routine
to not return without setting a return value when it
can't read a block error or detects a bad cylinder group,
since the caller is expecting a return value.
It will now panic at this point, since the thing
to do in this case would be to return a "bad block"
status to the caller, and the caller will panic
anyways when that happens.

Also updated to panic strings in this routine to read
"ffs_checkblk: ..." instead of "checkblk: ...".


22540 10-Feb-1997 mpp

Make this compile after the Lite2 merge.

A non-existent variable was being used.


22539 10-Feb-1997 mpp

Make ffs_subr.c compile when DIAGNOSTIC is defined.
It looks like this was broken before the Lite2 merge :-(.
VOP_BMAP was being called with the wrong number of arguments.


22521 10-Feb-1997 dyson

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


21673 14-Jan-1997 jkh

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


21002 29-Dec-1996 dyson

This commit is the embodiment of some VFS read clustering improvements.
Firstly, now our read-ahead clustering is on a file descriptor basis and not
on a per-vnode basis. This will allow multiple processes reading the
same file to take advantage of read-ahead clustering. Secondly, there
previously was a problem with large reads still using the ramp-up
algorithm. Of course, that was bogus, and now we read the entire
"chunk" off of the disk in one operation. The read-ahead clustering
algorithm should use less CPU than the previous also (I hope :-)).

NOTE: THAT LKMS MUST BE REBUILT!!!


20311 11-Dec-1996 dyson

Significant performance improvement for mmap'ed files. This commit
makes MADV_SEQUENTIAL much more effective. I suggest that
we start using MADV_SEQUENTIAL on system utilities that mmap
their input files, and the I/O is predominantely sequential.
Below is a test with 'cmp' on two relatively large binary files,
where the files are so large that the caching is ineffective:

+ ls -l t1.xxx t2.xxx
-rw-r--r-- 1 root wheel 65598384 Dec 10 12:13 t1.xxx
-rw-r--r-- 1 root wheel 65598384 Dec 10 12:14 t2.xxx

+ time cmp t1.xxx t2.xxx
3.78user 0.70system 1:33.43elapsed 4%CPU

+ time cmpmadv t1.xxx t2.xxx
4.21user 1.05system 0:30.93elapsed 17%CPU

This change is as a result of an observation made by BDE.


20070 01-Dec-1996 bde

Removed all references to b_cylinder (aka b_cylin). It was evil and
hasn't been used for a year or two since disksort() started sorting
on b_pblkno.


20061 01-Dec-1996 sos

This update adds the support for != 512 byte sector SCSI devices to
the sd & od drivers. There is also slight changes to fdisk & newfs
in order to comply with different sectorsizes.
Currently sectors of size 512, 1024 & 2048 are supported, the only
restriction beeing in fdisk, which hunts for the sectorsize of
the device.
This is based on patches to od.c and the other system files by
John Gumb & Barry Scott, minor changes and the sd.c patches by
me.
There also exist some patches for the msdos filesys code, but I
havn't been able to test those (yet).

John Gumb (john@talisker.demon.co.uk)
Barry Scott (barry@scottb.demon.co.uk)


19700 13-Nov-1996 julian

Submitted by: Archie and me.

We encountered an interesting situation where the superblock for
a file system got written to disk with the "fs_fmod" flag set to
one. It appears that this flag is normally supposed to be cleared
during ffs_sync(), but we experienced a crash, or some other weird
occurrence that left it on the disk set to 1.

Later this partition was mounted read-only... and the fs_fmod
field was never cleared, causing ffs_sync() to panic "rofs mod"
when trying to unmount that filesystem (ffs_vfsops.c: line 790).

fix:
set this bit to 0 when you load the superblock from disk.
(see more complete mail on this to hackers)


19424 05-Nov-1996 dg

Eliminate an unnecessary synchronous write (and an 8K bcopy+bzero) when
truncating/deleting large files.

Reviewed by: mckusick, dyson
Submitted by: Kirk McKusick <mckusick@mckusick.com>, modified for
FreeBSD by me.


19403 04-Nov-1996 hsu

struct mfsnode bloated in size by 12 bytes, so reduce spare padding by 3 longs.
We now only have 4 spare bytes before hitting the dreaded 32 byte threshold.


19388 04-Nov-1996 bde

Fixed some races and misleading comments in ufs_rename().

1. When a directory is renamed to an existing (empty) directory,
it is possible for the target vnode to become the source vnode
underneath you (because another process may complete the same
rename). It was assumed that this can't happen, and the bogus
errno EINVAL was returned. This was fairly harmless.

Fix: return ENOENT instead, as if the source directory was renamed
a little earlier.

2. The same metamorphosis is possible for non-directories. It was
assumed that this can't happen, and the code for handling "just
removing a link name" happened to be used. This would have worked
except for fatal bugs in the link name removal - the link name was
assumed to still be there, and a null pointer was followed.

Fix: check the result of relookup(). This fixes PR 1930.

Notes:

(a) POSIX seems to say that removing link names shall have no effect.
BSD (4.4Lite2 at least) does something reasonable instead.

(b) The relookup() may find a file unrelated to the original.
Removing this isn't correct. Consider 3 existing files A, B and
C, and concurrent renames: AB = rename(A, B), another AB, and
CA = rename("c", "a"). If rename() is atomic, then only the
following results are possible:

AB, AB (fails), CA: A = original C, B = original A, C = gone
AB, CA, AB: A = gone, B = original C, C = gone
CA, AB, AB (fails): A = gone, B = original C, C = gone

but ufs_rename() can give:

A,AB,CA,B (sorta): A = gone, B = original A, C = gone

This usually doesn't matter, since getting into a race is usually
an error.
---

These fixes should be in 2.1.6 and 2.2.


18899 12-Oct-1996 bde

Fixed lblktosize(). It overflowed at 2G. This bug only affected
ufs_read() and ufs_write().

Found by: looking at warnings for comparing the result of lblktosize()
(which is usually daddr_t = long) with file sizes (which are u_quad_t
for ufs). File sizes should probably be off_t's to avoid warnings
when the are compared with file offsets, so the fixed lblktosize()
casts to off_t instead of u_quad_t.

Added definition of smalllblksize(). It is the same as the old
lblksize() and is more efficient for small block numbers on 32-bit
machines.

Use smalllblktosize() instead of its expansion in blksize() and
dblksize(). This keeps the line length short and makes it more
obvious that the shift can't overflow.


18429 20-Sep-1996 bde

Don't include <sys/conf.h> for the kernel in disk-related headers.
It is needed for implementation details but very little of it is
needed for the interface. Include it in the few places that didn't
already include it.

Include <sys/ioccom.h> in <sys/disklabel.h> (as already in
<sys/diskslice.h>) so that all the disk-related headers are almost
self-sufficient.


18413 20-Sep-1996 nate

Whoops, I should've used the LINT config file. More ts -> tv changes
for timespec structure.


18397 19-Sep-1996 nate

In sys/time.h, struct timespec is defined as:

/*
* Structure defined by POSIX.4 to be like a timeval.
*/
struct timespec {
time_t ts_sec; /* seconds */
long ts_nsec; /* and nanoseconds */
};

The correct names of the fields are tv_sec and tv_nsec.

Reminded by: James Drobina <jdrobina@infinet.com>


18330 17-Sep-1996 peter

Argh, I have had one "uid 0 on /: file system full" too many. The problem
is that it doesn't say _what_ did it! (the core dumped console message
is very useful for listing the process name and pid). This adds similar
information.


18104 07-Sep-1996 dyson

Fix a VOP_UNLOCK panic when using options DIAGNOSTIC during dismount.


18069 06-Sep-1996 gibbs

Use bowrite instead of VOP_BWRITE in a few cases. This can probably be taken
further.


18020 03-Sep-1996 bde

Eliminated nested include of <sys/unistd.h> in <sys/file.h> in the kernel.
Include it directly in the few places where it is used.

Reduced some #includes of <sys/file.h> to #includes of <sys/fcntl.h> or
nothing.


18006 03-Sep-1996 dg

Implemented kernel side of MNT_NOATIME mount option. This option disables
the file access time update on reads and can be useful in reducing
filesystem overhead in cases where the access time is not important (like
Usenet news spools).


17971 31-Aug-1996 bde

Don't depend in the kernel on the gcc feature of doing arithmetic on
pointers of type `void *'. Warn about this in future.


17761 21-Aug-1996 dyson

Even though this looks like it, this is not a complex code change.
The interface into the "VMIO" system has changed to be more consistant
and robust. Essentially, it is now no longer necessary to call vn_open
to get merged VM/Buffer cache operation, and exceptional conditions
such as merged operation of VBLK devices is simpler and more correct.

This code corrects a potentially large set of problems including the
problems with ktrace output and loaded systems, file create/deletes,
etc.

Most of the changes to NFS are cosmetic and name changes, eliminating
a layer of subroutine calls. The direct calls to vput/vrele have
been re-instituted for better cross platform compatibility.

Reviewed by: davidg


17108 12-Jul-1996 bde

Don't use NULL in non-pointer contexts.


17040 09-Jul-1996 wollman

Quiet a couple of -Wunused warnings.


16681 25-Jun-1996 dg

Fixed end condition for clustered reads.

Submitted by: Kirk McKusick via Lite-2 and email


16322 12-Jun-1996 gpalmer

Clean up -Wunused warnings.

Reviewed by: bde


16312 12-Jun-1996 dg

Moved the fsnode MALLOC to before the call to getnewvnode() so that the
process won't possibly block before filling in the fsnode pointer (v_data)
which might be dereferenced during a sync since the vnode is put on the
mnt_vnodelist by getnewvnode.

Pointed out by Matt Day <mday@artisoft.com>


15680 08-May-1996 gpalmer

Clean up various compiler warnings. Most (if not all) were benign

Reviewed by: bde


15576 03-May-1996 phk

disksort() is gone, all drivers now use tqdisksort().


15543 02-May-1996 phk

removed:
CLBYTES PD_SHIFT PGSHIFT NBPG PGOFSET CLSIZELOG2 CLSIZE pdei()
ptei() kvtopte() ptetov() ispt() ptetoav() &c &c
new:
NPDEPG

Major macro cleanup.


15493 01-May-1996 bde

Removed bogus _BEGIN_DECLS/_END_DECLS.

Removed unused struct tag declarations in cloned code.

Added or cleaned up idempotency ifdefs.


15315 19-Apr-1996 bde

Yet more b_flags fixes. The previous ones broke the clearing of B_DONE
and B_READ before writing. This was was fatal. They also broke the
clearing of B_INVAL before doing i/o. This didn't actually matter.

Submitted by: mostly by joerg


15140 08-Apr-1996 phk

Replace usage of buf->b_actf by queue.3 and buf->b_act


14909 29-Mar-1996 bde

Fixed reference counting related to relookup(). relookup() must
be called with the directory referenced, and this reference will
be dropped iff relookup() fails, so the value returned must not be
ignored.

Reviewed by: davidg


14831 27-Mar-1996 hsu

Make type compatible with Lite2.
Submitted by: bde


14345 02-Mar-1996 dyson

Handle the bogus device that MFS uses as its VBLK device. We now don't
try to VMIO open it on MFS mounts. This will fix the mfs_badops
panic.


14317 02-Mar-1996 dyson

Enable VMIO for non-VDIR metadata and block device.


14315 02-Mar-1996 dyson

More b_flags fixes.


14312 01-Mar-1996 dyson

Fix a bug that b_flags was getting unnecessarily modified by
the slice code. The effect up to now has been insignficant, but
improved buffer allocation code will break with this problem.


14279 27-Feb-1996 mpp

Add a prototype for the quotactl system call.


14249 25-Feb-1996 bde

Removed vestigial support for the obsolete FIFO option. In ext2fs
it caused null pointer panics for all fifo operations unless FIFO
was defined.


13765 30-Jan-1996 mpp

Fix a bunch of spelling errors in the comment fields of
a bunch of system include files.


13490 19-Jan-1996 dyson

Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.


13424 14-Jan-1996 bde

Partially fixed negative and truncated "Avail" counts in df output.
This fixes PR943.

ffs/ffs_vfsops.c:
ffs_statfs() multiplied by (100 - minfree) as part of calculating the
minfree percentage (complemented in 100%), so with the standard minfree
of 8, it was broken for file systems of size >= 1TB/92 = 11GB. Use the
standard freespace() macro instead. This also fixes a rounding bug (the
"Avail" count was sometimes 1 too small).

ffs/* (not fixed):
The freespace() macro multiplies by minfree, so with the standard
minfree of 8, it is broken for file systems of size >= 1TB/8 = 128GB.
This bug is more serious since it affects block allocation.

ffs/ffs_alloc.c (not fixed):
Ordinary users are sometimes allowed to allocate 1 (partial) block
too many so that the "Avail" count goes negative. E.g., if there is
1 fragment available and the file is fairly large, one more full
block is allocated.

df/df.c:
ufs_df() used/uses essentially the same code as ffs_statfs(), so it
had/has the same bugs.

ufs_df() gratuitously replaced "Avail" counts of < 0 by 0, so it
gave different results for non-mounted file systems in this case.


13309 07-Jan-1996 phk

The second cast wasn't needed.
Submitted by: bde


13273 06-Jan-1996 phk

Fix the asami&phk bug. This was a sign-extension bug, where a long
got multiplied by a constant before being upgraded to long long.
This should fix kern/104 and possibly kern/105.
Thanks to: dyson & asami.


13260 05-Jan-1996 wollman

Convert QUOTA to new-style option.


13228 04-Jan-1996 wollman

Convert DDB to new-style option.


13122 30-Dec-1995 peter

recording cvs-1.6 file death


12976 22-Dec-1995 bde

Fixed prototyping and staticizing for -DDEBUG case.


12971 22-Dec-1995 phk

Staticize.


12911 17-Dec-1995 phk

Staticize.


12861 15-Dec-1995 peter

Silence a harmless warning...


12838 14-Dec-1995 bde

Included <sys/conf.h> and updated to indirect devswitches so that
this compiles again, and added a prototype.


12825 14-Dec-1995 peter

*hack alert*! :-) This adds an option to the MFS_ROOT code so that it
is possible to boot a kernel with an empty in-core MFS image, and have
it load the image from floppy directly. This is admittedly a hack and
would be better replaced by a self-loading ram-disk.


12767 11-Dec-1995 dyson

Changes to support 1Tb filesizes. Pages are now named by an
(object,index) pair instead of (object,offset) pair.


12662 07-Dec-1995 dg

Untangled the vm.h include file spaghetti.


12653 06-Dec-1995 bde

Fixed compilation of lfs utilities which I broke the other day by
#including lfs_extern.h and goop to support it in lfs_conv.c.


12590 03-Dec-1995 bde

Completed function declarations and/or added prototypes and/or #includes
to get the prototypes.


12500 28-Nov-1995 bde

Removed bogus __BEGIN_DECS/__END_DECLS.


12499 28-Nov-1995 peter

After having put on my Asbestos suit, complete the MFS_ROOT part of Terry's
mountroot changes. This means that the mfs_initminiroot functionality
into the root mfs_mount....


12497 28-Nov-1995 peter

Attempt to solve the busy-buffers-on-shutdown caused by MFS once and for all.

What was happening, was that the main mfs loop was sleeping, and when it was
being awoken by a wakeup when it was supposed to process some IO requests.

The problem was that if it was being woken out of the tsleep() by a signal
at shutdown, it was going straight into dounmount() without servicing any
pending IO requests, causing dounmount() to fail because there were busy
buffers (and they could not be "processed" because the processing loop was
trying to unmount rather than dispatching into mfs_doio()).

This (dare I say it :-) appears to be a layering problem....


12460 23-Nov-1995 dyson

Update the wd.c driver to use the new TAILQ scheme for device
buffer queue. Also, create a new subroutine 'tqdisksort' that
is an improved version of the original disksort that also uses
TAILQs.


12453 21-Nov-1995 bde

Completed function declarations and/or added prototypes.


12424 20-Nov-1995 phk

Fix compiler warnings.


12405 19-Nov-1995 dyson

General fixes to the vfs clustring code:

1) Make cluster buffer list be a non-malloced chain. This eliminates
yet another 'evil' M_WAITOK and generally cleans up the code.
2) Fix write clustering for ext2fs. It was just broken. Also, ffs
clustering had an efficiency problem that more bawrites were happening
than should have been.
3) Make changes to buf.h to support the above, plus remove b_pfcent
at the request of David Greenman.

Note that the reallocblocks code is disabled pending rewrite for
the cluster buffer list changes.


12399 19-Nov-1995 dyson

Change incorrect '#if EXT2FS' to '#ifdef EXT2FS'


12288 14-Nov-1995 phk

Get rid of the last debug sysctl variables of the old style.


12221 12-Nov-1995 bde

Included <sys/sysproto.h> to get central declarations for syscall args
structs and prototypes for syscalls.

Ifdefed duplicated decentralized declarations of args structs. It's
convenient to have this visible but they are hard to maintain. Some
are already different from the central declarations. 4.4lite2 puts
them in comments in the function headers but I wanted to avoid the
large changes for that.


12158 09-Nov-1995 bde

Introduced a type `vop_t' for vnode operation functions and used
it 1138 times (:-() in casts and a few more times in declarations.
This change is null for the i386.

The type has to be `typedef int vop_t(void *)' and not `typedef
int vop_t()' because `gcc -Wstrict-prototypes' warns about the
latter. Since vnode op functions are called with args of different
(struct pointer) types, neither of these function types is any use
for type checking of the arg, so it would be preferable not to use
the complete function type, especially since using the complete
type requires adding 1138 casts to avoid compiler warnings and
another 40+ casts to reverse the function pointer conversions before
calling the functions.


12120 06-Nov-1995 dyson

This commit causes UFS to perform at Linux EXT2FS metadata rates. After
earlier discussions with DG, and a recent email exchange with SEF, I
decided to allow UFS to run wide-open on an experimental basis. We
will probably support eventually multiple async modes, and this is
the fastest the we can expect. Just use the -o async flag on the
UFS mount. Good luck...


12117 05-Nov-1995 dyson

Changes to existing files for ext2fs support. The UFS mods need rework
in the future as they are a bit crufty -- but at least the stuff is in the
tree now.


12114 05-Nov-1995 dyson

Fix ufs_bmap so that triple indirect blocks might work.
Submitted by: Godmar Back <gback@facility.cs.utah.edu>


12111 05-Nov-1995 dyson

Make MNT_ASYNC more effective for UFS. It should not be too much more
dangerous than the original MNT_ASYNC. There might be some minor
security considerations due to data writes not being posted as promptly
as before. Meta-data operations are still not quite as fast as Linux,
but streaming I/O is still higher.


11953 31-Oct-1995 peter

mfs_open could panic with false identification: panic("mfs_ioctl: ....


11701 23-Oct-1995 dyson

Finalize GETPAGES layering scheme. Move the device GETPAGES
interface into specfs code. No need at this point to modify the
PUTPAGES stuff except in the layered-type (NULL/UNION) filesystems.


11644 22-Oct-1995 dg

Moved the filesystem read-only check out of the syscalls and into the
filesystem layer, as was done in lite-2. Merged in some other cosmetic
changes while I was at it. Rewrote most of msdosfs_access() to be more
like ufs_access() and to include the FS read-only check.

Obtained from: partially from 4.4BSD-lite2


11297 07-Oct-1995 bde

Return EINVAL instead of panicing for rename("dir1", "dir2/..").

Fixes part of PR 760.

This bug seems to be very old.


11264 06-Oct-1995 phk

use roundup2 to avoid a bunch of 64bit divides.


10998 25-Sep-1995 dyson

Re-enable read clustering.


10949 22-Sep-1995 dg

Shit! I changed the wrong doclusterread! ...Thanks to Steven Wallace and
Poul-Henning for convincing me that I should look at my mistake! :-)


10946 22-Sep-1995 dg

Disable file read clustering until the bug(s) in vfs_cluster.c are fixed.
This should temporarily fix the sig 10/11 problems that people have been
having for the past 3 weeks.


10823 16-Sep-1995 bde

Remove transitory labelling code. Labels are now handled by essentially
the original 4.4lite code. Machine Specific Partitions are now handled
separately.


10675 11-Sep-1995 bde

Fix benign type mismatch in a call to VOP_BMAP().


10646 09-Sep-1995 julian

Obtained from:4.4lite2
fix a change where a shortcut resulted in teh wrong answer..

e.g.
touch a
touch b
mv a b
resulted in b being removed and a being moved to b

in the shortcut..
touch a
ln a b
mv a b
the wrong link was removed..
leaving a instead of b, giving a different result to when
both files were separate.


10632 08-Sep-1995 dg

Slight optimization for the standard case of rotdelay=0.


10597 07-Sep-1995 dyson

Correct a case in the ffs_getpages where a page is not found in
a sparse file and the page is zeroed but not set valid, clean.


10578 06-Sep-1995 dyson

Added indirect pointer for ffs_getpages, and added external declaration.


10577 06-Sep-1995 dyson

Added new ffs_getpages routine. It isn't optimized yet, but FFS
now does it's own getpage -- instead of using the default routine
in vnode_pager.c.


10552 04-Sep-1995 dyson

Correct prototype for ufs_bmaparray()


10551 04-Sep-1995 dyson

Added VOP_GETPAGES/VOP_PUTPAGES and also the "backwards" block count
for VOP_BMAP. Updated affected filesystems...


10431 30-Aug-1995 bde

Declare vfs_mountroot() in the right place.


10389 28-Aug-1995 bde

Fix correct_writedisklabel() and writedisklabel(). Their setting of
bp->b_flags has been broken for many years:
a) they didn't set B_BUSY for doing i/o. This has been fatal since
1995/07/25 when biodone() started checking that B_BUSY is set.
b) they didn't set B_INVAL for releasing the buffer. This at best
just put a useless buffer in the LRU queue for a little while.

Fix a couple of spelling errors and complete a couple of function
pointer declarations.


10358 28-Aug-1995 julian

Reviewed by: julian with quick glances by bruce and others
Submitted by: terry (terry lambert)
This is a composite of 3 patch sets submitted by terry.
they are:
New low-level init code that supports loadbal modules better
some cleanups in the namei code to help terry in 16-bit character support
some changes to the mount-root code to make it a little more
modular..

NOTE: mounting root off cdrom or NFS MIGHT be broken as I haven't been able
to test those cases..

certainly mounting root of disk still works just fine..
mfs should work but is untested. (tomorrows task)

The low level init stuff includes a total rewrite of init_main.c
to make it possible for new modules to have an init phase by simply
adding an entry to a TEXT_SET (or is it DATA_SET) list. thus a new module can
be added to the kernel without editing any other files other than the
'files' file.


10269 25-Aug-1995 bde

Don't call VOP_UPDATE() with volatile timestamps.


10129 20-Aug-1995 dg

Fixed mfs reboot panic by never returning failure from mfs_start().

Obtained from: 4.4BSD-Lite2


10080 16-Aug-1995 bde

Make everything except the unsupported network sources compile cleanly
with -Wnested-externs.


10078 16-Aug-1995 dg

Honor -async mount option when doing the inode update.

Obtained from: 4.4BSD-Lite2


10027 11-Aug-1995 dg

Converted mountlist to a CIRCLEQ.

Partially obtained from: 4.4BSD-Lite2


9984 07-Aug-1995 dg

On closer inspection, it turns out that all of the callers of disksort
are already at splbio()...so back out the last change to disksort.


9982 07-Aug-1995 dg

Since buffers can be pulled off of the disk queue at interrupt time and
disksort is called at non-interrupt time and can be actively traversing
the list when that happens, there is a very small window of vulnerability.
Close it by protecting disksort with splbio().


9980 07-Aug-1995 dg

Use bdwrite() rather than brelse(). The cylinder group bitmap modification
is not preserved otherwise.
Note that this is a no-op in FreeBSD, however, as we have doreallocblks
disabled.

Submitted by: Kirk McKusick


9968 06-Aug-1995 dg

Removed redundant call to vm_object_page_clean: this is already handled
by vfs_msync().


9967 06-Aug-1995 dg

Removed redundant call to vm_object_page_clean - this is already done
in vfs_msync().


9886 04-Aug-1995 dg

Use the correct flags (IO_SYNC -> B_SYNC) when deciding to do a sync or
async write in the section that changes the filesize. The bug resulted
in the updates always being async.

Obtained from: 4.4BSD-Lite2


9842 01-Aug-1995 dg

Removed my special-case hack for VOP_LINK and fixed the problem with the
wrong vp's ops vector being used by changing the VOP_LINK's argument order.
The special-case hack doesn't go far enough and breaks the generic
bypass routine used in some non-leaf filesystems. Pointed out by Kirk
McKusick.


9759 29-Jul-1995 bde

Eliminate sloppy common-style declarations. There should be none left for
the LINT configuation.


9618 21-Jul-1995 dg

Since ufs_ihashget can block, the lock must be checked for each time
the function returns. Also, moved lock into .bss and made minor cosmetic
changes.

Submitted by: Bruce Evans


9601 21-Jul-1995 dg

Implement a lock in ffs_vget to prevent a race condition where two processes
try allocate the same inode/vnode, causing a duplicate.

Submitted by: Matt Dillon, slightly reworked by me.


9507 13-Jul-1995 dg

NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!

Much needed overhaul of the VM system. Included in this first round of
changes:

1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".

2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.

3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.

4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.

5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.

6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.

7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.

8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.

9) Some almost useless debugging code removed.

10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.

11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.

12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).

13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.

14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)

TODO:

1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.

2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.

3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.

4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.

5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).


9356 28-Jun-1995 dg

1) Converted v_vmdata to v_object.
2) Removed unnecessary vm_object_lookup()/pager_cache(object, TRUE) pairs
after vnode_pager_alloc() calls - the object is already guaranteed to be
persistent.
3) Removed some gratuitous casts.


9354 28-Jun-1995 dg

Fixed VOP_LINK argument order botch.


8876 30-May-1995 rgrimes

Remove trailing whitespace.


8831 29-May-1995 phk

Mount MFS as root RW. Remounting doesn't make sense.

Reviewed by: davidg


8805 28-May-1995 dg

Kill bogus vnode_pager_setsize(). It was being called at the wrong time
and resulted in the object size being too small. This caused bad things
to happen later when the file was mapped.

Reviewed by: John Dyson


8692 21-May-1995 dg

Changes to fix the following bugs:

1) Files weren't properly synced on filesystems other than UFS. In some
cases, this lead to lost data. Most likely would be noticed on NFS.
The fix is to make the VM page sync/object_clean general rather than
in each filesystem.
2) Mixing regular and mmaped file I/O on NFS was very broken. It caused
chunks of files to end up as zeroes rather than the intended contents.
The fix was to fix several race conditions and to kludge up the
"b_dirtyoff" and "b_dirtyend" that NFS relies upon - paying attention
to page modifications that occurred via the mmapping.

Reviewed by: David Greenman
Submitted by: John Dyson


8624 19-May-1995 dg

NFS diskless operation was broken because swapdev_vp wasn't initialized.
These changes solve the problem in a general way by moving the
initialization out of the individual fs_mountroot's and into swaponvp().

Submitted by: Poul-Henning Kamp


8530 15-May-1995 dg

Fixed incompleteness that would allow dirty filesystems to get mounted
when the single user shell was terminated. These changes disallow mounting
or R/W upgrading filesystems that are dirty unless "-f" (force) option
is used with mount. /etc/rc has been modified to abort the startup if
one or more non-nfs partitions fail to mount.

Reviewed by: Poul-Henning Kamp, Rod Grimes


8529 15-May-1995 dg

From Bruce Evans:
I ran into another manifestation of the problem reported in PR 211 and
fixed it. Try this:

as non-root:
cd /tmp; mkdir x y x/z
as root:
chown root /tmp/x/z
as non-root:
cd /tmp/x; mv z ../y # EACCES as expected
as root:
cd /tmp/x; mv z ../y # EINVAL NOT as expected

This is because ufs_rename() sets IN_RENAME and fails to clear it.

Reviewed by: davidg
Submitted by: bde


8456 11-May-1995 rgrimes

Fix -Wformat warnings from LINT kernel.


8210 01-May-1995 dyson

Limit filesize to the amount that the VM system can currently handle
(2GB). If this limit is not imposed, then filesystem corruption will
ensue when files larger than 2GB are created. This is temporary,
and the underlying limitation will be removed later.


8054 25-Apr-1995 phk

Add a printf so we can see where we get our rootfs from.


8053 25-Apr-1995 dyson

Fixed the mmap hang fix previously committed so that it works
with options DIAGNOSTIC, and clear up an additional reference
count problem.


8041 24-Apr-1995 dyson

Changes to get rid of ufslk2 hangs when doing read/write to/from
mmap regions that are in the same file as the read/write.


7876 16-Apr-1995 dg

Make vegetarian and animal rights people happy and use 0xdeadc0de instead
of 0xdeadbeef as the 'spare' value.


7752 11-Apr-1995 dg

Handle the "syncing VCHR vnode hang" problem a little differently; just
don't lock the vnode - it doesn't appear to ever be necessary for VCHR
vnode/inodes. This fixes a bug introduced in the previous commit that
caused tty timestamps to act strange (causing 'w' and 'finger' to show
the tty wasn't idle when it may have been for hours).


7695 09-Apr-1995 dg

Changes from John Dyson and myself:

Fixed remaining known bugs in the buffer IO and VM system.

vfs_bio.c:
Fixed some race conditions and locking bugs. Improved performance
by removing some (now) unnecessary code and fixing some broken
logic.
Fixed process accounting of # of FS outputs.
Properly handle NFS interrupts (B_EINTR).

(various)
Replaced calls to clrbuf() with calls to an optimized routine
called vfs_bio_clrbuf().

(various FS sync)
Sync out modified vnode_pager backed pages.

ffs_vnops.c:
Do two passes: Sync out file data first, then indirect blocks.

vm_fault.c:
Fixed deadly embrace caused by acquiring locks in the wrong order.

vnode_pager.c:
Changed to use buffer I/O system for writing out modified pages. This
should fix the problem with the modification date previous not getting
updated. Also dramatically simplifies the code. Note that this is
going to change in the future and be implemented via VOP_PUTPAGES().

vm_object.c:
Fixed a pile of bugs related to cleaning (vnode) objects. The performance
of vm_object_page_clean() is terrible when dealing with huge objects,
but this will change when we implement a binary tree to keep the object
pages sorted.

vm_pageout.c:
Fixed broken clustering of pageouts. Fixed race conditions and other
lockup style bugs in the scanning of pages. Improved performance.


7430 28-Mar-1995 bde

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) that I didn't notice when I fixed
"all" such warnings before.


7399 26-Mar-1995 dg

Removed third arg (vmio) to allocbuf() that was added with the original
merged cache changes, and figure it out based on the B_VMIO buffer flag.
Fixes a problem where delayed write VMIO buffers would sometimes get
recopied into kernel-alloced memory.

Submitted by: John Dyson


7170 19-Mar-1995 dg

Removed redundant newlines that were in some panic strings.


7169 19-Mar-1995 dg

Backed out change to panic call: As Chris just pointed out to me, panic()
does indeed work like printf(). gdb gets the string untranslated for some
reason.


7156 19-Mar-1995 dg

Fix a call to panic: panic doesn't do token substitution on the panic
string.


7145 18-Mar-1995 dg

Don't sync the inode date changes of character special devices
during the FS sync. The system would appear to hang momentarily
if there was a large backlog of I/O. This is because the vnode
remains locked during the output - preventing normal character
I/O. The problem was exacerbated by the FFS contiguous block
allocation fixes and a semi-broken disksort(). The inode/date
will still be synced during a normal FS dismount and whenever
the inode is changed for other reasons.


7133 18-Mar-1995 dg

Woops, add back that #define...it's used later in the file.


7126 18-Mar-1995 dg

Fixed comments and removed b_cylinder #define.


7125 18-Mar-1995 dg

Integrated change from 1.1.5: Fixed broken disksort to sort by pblkno
rather than by cylinder.


7090 16-Mar-1995 bde

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


7018 12-Mar-1995 bde

Finish the previous change. The device name got lost in diskerr().


7006 11-Mar-1995 dg

Removed gratuitous and *extremely* evil setting of OBJ_INTERNAL. This
caused a cascade of problems including kernel memory corruption, file
corruption, system hangs, and panics.


6994 10-Mar-1995 dg

Increased default minfree to 8%.


6993 10-Mar-1995 dg

The threshold for switching from time-space and space-time is too small
when minfree is 5%...so make it stay at space in this case.

Submitted by: Kirk McKusick


6992 10-Mar-1995 dg

Patch to fix quota panic from Mike Karels:

allow Q_SYNC regardless of "target" uid, we allow it with -1;
fix bug that caused all ops to refer to user quotas, not group.

Submitted by: Mike Karels


6875 04-Mar-1995 dg

Removed obsolete vtrace() remnants.


6864 03-Mar-1995 dg

Fixes from John Dyson to work around vnode lock hang. Basically, remove
the VOP_BMAP calls, and add one to bdwrite.

Submitted by: John Dyson


6769 27-Feb-1995 se

Don't try to make use of useless rotational position optimisation,
if all free blocks are in the same bucket (i.e. NRPOS == 1).
Else a free block is choosen, possibly from a different cylinder,
even if the block succeeding bpref was free ...

Submitted by: se


6640 22-Feb-1995 bde

Use dsname() to get consistent names.


6505 16-Feb-1995 bde

Adjust slice names in diskerr() for the rearranged slice numbers. The
mapping from numbers to names is messy for backwards compatibility.
E.g., for driver "sd", unit "0":

slice 0: omit the slice number for compatibility; names are sd0[a-h].
slice 1: omit the partition letter 'c' because the whole disk device
shouldn't have anything to do with partitions; sd0 is the
only name.
slices 2-31: subtract 1 from slice number to compensate for the
compatibility slice 0; names are sd0s[1-30][a-h].


6357 14-Feb-1995 phk

YF fix.


6151 03-Feb-1995 dg

Fixed bmap run-length brokeness.
Use bmap run-length extension when doing clustered paging.

Submitted by: John Dyson


5840 24-Jan-1995 dg

Removed some unused/obsolete code.

Submitted by: John Dyson


5455 09-Jan-1995 dg

These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.

The majority of the merged VM/cache work is by John Dyson.

The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.

vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.

vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.

vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.

vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.

vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.

pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.

vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.

proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.

swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.

machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.

machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.

ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.

Submitted by: John Dyson and David Greenman


5392 04-Jan-1995 gibbs

Change panic messges that are ffs_blah functions to say they are ffs not
ufs functions.


5391 04-Jan-1995 gibbs

LFS stability patches. There is still a problem with directory update
ordering that can prove fatal during large batches of deletes, but this
is much better than it was. I probably won't be putting much more time
into this until Seltzer releases her new version of LFS which has
fragment support. This should be availible just before USENIX.


5248 27-Dec-1994 bde

Use the same current time throughout ffs_update().

Update some macro names in comments.

Don't use MNT_WAIT for something not related to mounting.


5247 27-Dec-1994 bde

Use the same current time throughout ITIMES(). I want all current
timestamps for an atomic operation such as rename() on a local file
system to be identical.

Uniformize yet another idempotency ifdef. The comment nesting was
bogus.


5185 22-Dec-1994 bde

Print `slicename' and not a bogus pointer in diskerr()


5126 16-Dec-1994 bde

Duplicate readdisklabel() and writedisklabel() and remove DOS stuff from
from the copies to create correct_readdisklabel() and
correct_writedisklabel().

Print the slice number in diskerr() if it is nonzero.


4827 26-Nov-1994 bde

Submitted by: Kirk McKusick

Allow chown() to return success if the gid isn't changed even if
the gid is not the caller's. Such gids are normal for files created
in world-writable directories sucj as /tmp. This "fixes" annoying
error messages for mv'ing files created in /tmp to another file
system. mv still preserves the foreign gid of /tmp, but now does
it silently.


4535 17-Nov-1994 gibbs

John Dyson's patches (and a few from me too) to LFS to use a different
buffering scheme and make it more in tune with FreeBSD's vfs_bio
implementation. The filesystem seems fairly stable, but I wouldn't recommend
it to anyone not willing to experience problems. This is very green code and
has the limitation that YOU CAN ONLY HAVE ONE LFS PARTITION MOUNTED AT A TIME.

What LFS is good for:

Non fsynced writes FASTER THAN FFS
Large deletions Increadibly fast

Reads are a little bit slower than FFS right now, but that is a factor of
how under optimized this code is. LFS should in theory perform at least as
well as FFS under fsync (iozone) type loads, and this is what I'm currently
working on.

Reviewed by: Justin Gibbs
Submitted by: John Dyson
Obtained from:


4464 14-Nov-1994 bde

Remove unused `struct disklabel' (the declarations that used it went away).

Uniformize idempotency ifdef.


4463 14-Nov-1994 bde

Undo a previous change. <sys/disklabel.h> was broken, not these files.


3962 28-Oct-1994 jkh

From: fredriks@mcs.com (Lars Fredriksen)
...
It turns out that these files do not include <sys/dkbad.h> before
<sys/disklabel.h>.
Submitted by: fredriks


3940 27-Oct-1994 jkh

Julian Elischer's disklabel fixes.


3768 22-Oct-1994 dg

Restrict fs_maxfilesize to 2^40, and check against this in ffs_truncate().
This is part of a bug fix from Kirk McKusick to work around problems in FFS
related to the blkno of a 64bit offset not fitting into an int. Note the
proper solution would be to deal with 64bit block numbers, but doing this
would require sweeping changes; some other day perhaps.

Submitted by: Marshall Kirk McKusick


3745 21-Oct-1994 wollman

Make my ALLDEVS kernel compile (basically, LINT minus a lot of options).

This involves fixing a few things I broke last time.


3653 17-Oct-1994 phk

This basically allows you to stick a disklabel on any partition.

For it to be useful, you must stick your disklabel on the partition which
starts where the MBR says FreeBSD lives. If you don't do that, you might
get a bad day.

Oh, that probably also means that putting swap there is a bad idea...


3605 15-Oct-1994 ache

Add back variable declaration removed by wrong previous cleanups


3604 15-Oct-1994 ache

Add back variable declaration removed by wrong prevous cleanups.


3487 10-Oct-1994 phk

Cosmetics. make gcc less noisy. Still some way to go here.


3451 09-Oct-1994 dg

Got rid of map.h. It's a leftover from the rmap code, and we use rlists.
Changed swapmap into swaplist.


3427 08-Oct-1994 phk

POSSIBLE BOGUS CODE found, (related to dos-partitions) in ufs_disksubr.c,
look for CC_WALL.
Cosmetics, a couple of unused vars.


3425 08-Oct-1994 phk

Cosmetics for gcc -Wall. A couple of unused "int i"'s removed and a couple of
prototypes added. And the usual () work.


3420 08-Oct-1994 phk

Cosmetics.


3396 06-Oct-1994 dg

Use tsleep() rather than sleep so that 'ps' is more informative about
the wait.


3167 28-Sep-1994 dfr

Make NFS ask the filesystems for directory cookies instead of making them
itself.


3148 27-Sep-1994 phk

Moved the "relookup" routine into vfs_lookup.c from ufs/ufs/ufs_vnops.c.
Several FS's use this, so it doesn't belong in ufs. (unionfs, msdosfs and ufs)


3103 25-Sep-1994 dg

Removed unimplemented subr_rmap.c and unused references to it.


2979 22-Sep-1994 wollman

More loadable VFS changes:

- Make a number of filesystems work again when they are statically compiled
(blush)

- FIFOs are no longer optional; ``options FIFO'' removed from distributed
config files.


2967 22-Sep-1994 wollman

Call ffs ``ufs'' for the benefit of poor, confused user-land programs.


2946 21-Sep-1994 wollman

Implemented loadable VFS modules, and made most existing filesystems
loadable. (NFS is a notable exception.)


2922 20-Sep-1994 bde

Use `1' for a boolean value instead of something irrelevant (MNT_WAIT)
that happens to be nonzero.


2689 12-Sep-1994 dg

Eliminated a whole pile of ancient (we're taking 4.3BSD) VM system
related #define constants. Corrected incorrect VM_MAX_KERNEL_ADDRESS.

Reviewed by: John Dyson


2460 02-Sep-1994 dg

panic if length is < 0 in ffs_truncate().


2384 29-Aug-1994 dg

"bogus" fixes from 1.1.5 to work around some cache coherency problems.


2177 21-Aug-1994 paul

Made idempotent
Reviewed by:
Submitted by:


2176 21-Aug-1994 paul

Made idempotent
Reviewed by:
Submitted by:


2152 20-Aug-1994 dg

Implemented filesystem clean bit via:

machdep.c:
Changed printf's a little and call vfs_unmountall() if the sync was
successful.

cd9660_vfsops.c, ffs_vfsops.c, nfs_vfsops.c, lfs_vfsops.c:
Allow dismount of root FS. It is now disallowed at a higher level.

vfs_conf.c:
Removed unused rootfs global.

vfs_subr.c:
Added new routines vfs_unmountall and vfs_unmountroot. Filesystems
are now dismounted if the machine is properly rebooted.

ffs_vfsops.c:
Toggle clean bit at the appropriate places. Print warning if an
unclean FS is mounted.

ffs_vfsops.c, lfs_vfsops.c:
Fix bug in selecting proper flags for VOP_CLOSE().

vfs_syscalls.c:
Disallow dismounting root FS via umount syscall.


2142 20-Aug-1994 dg

1) cleaned up after Garrett - fixed more redundant declarations, changed
use of timeout_t -> timeout_func_t in aha1542 and aha1742 drivers.
2) fix a bug in the portalfs that was uncovered by better prototyping -
specifically, the time must be converted from timeval to timespec
before storing in va_atime.
3) fixed/added some miscellaneous prototypes


2112 18-Aug-1994 wollman

Fix up some sloppy coding practices:

- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.

NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.


1960 08-Aug-1994 dg

Made lockf advisory locking code generic (rather than ufs specific), and
use it in NFS. This is required both for diskless support and for POSIX
compliance. Note: the support in NFS is only for the local node.

Submitted by: based on work originally done by Yuval Yurom


1937 08-Aug-1994 dg

Changed B_AGE policy to work correctly in a world with relatively large
buffer caches. The old policy generally ended up caching nothing.


1826 03-Aug-1994 dg

Changed occurrances of "itrunc" to "ffs_truncate" to make Bruce happy.


1821 02-Aug-1994 dg

Completed (hopefully) the kernel support for old style "fastlinks".


1817 02-Aug-1994 dg

Added $Id$


1549 25-May-1994 rgrimes

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


1542 24-May-1994 rgrimes

This commit was generated by cvs2svn to compensate for changes in r1541,
which included commits to RCS files with non-trunk default branches.