History log of /freebsd-10.2-release/sys/ufs/ffs/ffs_softdep.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 285830 23-Jul-2015 gjb

- Copy stable/10@285827 to releng/10.2 in preparation for 10.2-RC1
builds.
- Update newvers.sh to reflect RC1.
- Update __FreeBSD_version to reflect 10.2.
- Update default pkg(8) configuration to use the quarterly branch.[1]

Discussed with: re, portmgr [1]
Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation

# 284375 14-Jun-2015 kib

MFC r283832:
Remove unused variable.


# 284200 10-Jun-2015 kib

MFC r283604:
Remove NODELAY flag.


# 284199 10-Jun-2015 kib

MFC r283600:
Perform SU cleanup in the AST handler. Do not sleep waiting for SU cleanup
while owning vnode lock.

On MFC, for KBI stability, td_su member was moved to the end of the
struct thread.


# 284021 05-Jun-2015 kib

MFC r283735:
Remove several write-only variables.


# 281350 10-Apr-2015 kib

MFC r280760:
Fix the hand after the immediate reboot after the init binary is unlinked.

MFC r280763:
Fix build (with gcc).


# 278667 13-Feb-2015 kib

MFC r277922:
When mounting SU-enabled mount point, wait until the softdep_flush()
thread started and incremented the stat_flush_threads.

MFC r278257:
Partially revert r277922.


# 274305 09-Nov-2014 kib

MFC r273967:

Only trigger a panic when forced operation is done. Convert direct
panic() call into KASSERT().


# 270157 18-Aug-2014 mckusick

MFC of 269533 (by mckusick):

Add support for multi-threading of soft updates.

Replace a single soft updates thread with a thread per FFS-filesystem
mount point. The threads are associated with the bufdaemon process.

Reviewed by: kib
Tested by: Peter Holm and Scott Long
MFC after: 2 weeks
Sponsored by: Netflix

MFC of 269853 (by kib):

Revision r269457 removed the Giant around mount and unmount code, but
r269533, which was tested before r269457 was committed, implicitely
relied on the Giant to protect the manipulations of the softdepmounts
list. Use softdep global lock consistently to guarantee the list
structure now.

Insert the new struct mount_softdeps into the softdepmounts only after
it is sufficiently initialized, to prevent softdep_speedup() from
accessing bare memory. Similarly, remove struct mount_softdeps for
the unmounted filesystem from the tailq before destroying structure
rwlock.

Reported and tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation


# 270007 14-Aug-2014 mckusick

MFC of 269674:

The journal is only prepared to handle full-size block numbers, so we
have to adjust freeblk records to reflect the change to a full-size block.
For example, suppose we have a block made up of fragments 8-15 and
want to free its last two fragments. We are given a request that says:
FREEBLK ino=5, blkno=14, lbn=0, frags=2, oldfrags=0
where frags are the number of frags to free and oldfrags are the number
of fragments to keep. To block align it, we have to change it to have a
valid full-size blkno, so it becomes:
FREEBLK ino=5, blkno=8, lbn=0, frags=2, oldfrags=6

Submitted by: Mikihito Takehara
Tested by: Mikihito Takehara
Reviewed by: Jeff Roberson


# 268077 01-Jul-2014 scottl

Merge r265463:

Due to reasons unknown at this time, the system can be forced to write
a journal block even when there are no journal entries to be written.
Until the root cause is found, handle this case by ensuring that a
valid journal segment is always written.

Second, the data buffer used for writing journal entries was never
being scrubbed of old data. Fix this.

Submitted by: Takehara Mikihito
Obtained from: Netflix, Inc.


# 262779 05-Mar-2014 pfg

MFC r262678;
ufs: small formatting fixes.

Cleanup some extra space.
Use of tabs vs. spaces.
No functional change.

Reviewed by: mckusick


# 260078 30-Dec-2013 mckusick

MFC of 256801, 256803, 256808, 256812, 256817, 256845, and 256860.
This set of changes puts in place the infrastructure to allow soft
updates to be multi-threaded. It introduces no functional changes
from its current operation.

MFC of 256860:

Allow kernels without options SOFTUPDATES to build. This should fix the
embedded tinderboxes.

Reviewed by: emaste

MFC of 256845:

Fix build problem on ARM (which defaults to building without soft updates).

Reported by: Tinderbox
Sponsored by: Netflix

MFC of 256817:

Restructuring of the soft updates code to set it up so that the
single kernel-wide soft update lock can be replaced with a
per-filesystem soft-updates lock. This per-filesystem lock will
allow each filesystem to have its own soft-updates flushing thread
rather than being limited to a single soft-updates flushing thread
for the entire kernel.

Move soft update variables out of the ufsmount structure and into
their own mount_softdeps structure referenced by ufsmount field
um_softdep. Eventually the per-filesystem lock will be in this
structure. For now there is simply a pointer to the kernel-wide
soft updates lock.

Change all instances of ACQUIRE_LOCK and FREE_LOCK to pass the lock
pointer in the mount_softdeps structure instead of a pointer to the
kernel-wide soft-updates lock.

Replace the five hash tables used by soft updates with per-filesystem
copies of these tables allocated in the mount_softdeps structure.

Several functions that flush dependencies when too many are allocated
in the kernel used to operate across all filesystems. They are now
parameterized to flush dependencies from a specified filesystem.
For now, we stick with the round-robin flushing strategy when the
kernel as a whole has too many dependencies allocated.

While there are many lines of changes, there should be no functional
change in the operation of soft updates.

Tested by: Peter Holm and Scott Long
Sponsored by: Netflix

MFC of 256812:

Fourth of several cleanups to soft dependency implementation.
Add KASSERTS that soft dependency functions only get called
for filesystems running with soft dependencies. Calling these
functions when soft updates are not compiled into the system
become panic's.

No functional change.

Tested by: Peter Holm and Scott Long
Sponsored by: Netflix

MFC of 256808:

Third of several cleanups to soft dependency implementation.
Ensure that softdep_unmount() and softdep_setup_sbupdate()
only get called for filesystems running with soft dependencies.

No functional change.

Tested by: Peter Holm and Scott Long
Sponsored by: Netflix

MFC of 256803:

Second of several cleanups to soft dependency implementation.
Delete two unused functions in ffs_sofdep.c.

No functional change.

Tested by: Peter Holm and Scott Long
Sponsored by: Netflix

MFC of 256801:

First of several cleanups to soft dependency implementation.
Convert three functions exported from ffs_softdep.c to static
functions as they are not used outside of ffs_softdep.c.

No functional change.

Tested by: Peter Holm and Scott Long
Sponsored by: Netflix


# 260033 29-Dec-2013 mckusick

MFC of 258789:

We needlessly panic when trying to flush MKDIR_PARENT dependencies.
We had previously tried to flush all MKDIR_PARENT dependencies (and
all the NEWBLOCK pagedeps) by calling ffs_update(). However this will
only resolve these dependencies in direct blocks. So very large
directories with MKDIR_PARENT dependencies in indirect blocks had
not yet gotten flushed. As the directory is in the midst of doing a
complete sync, we simply defer the checking of the MKDIR_PARENT
dependencies until the indirect blocks have been sync'ed.

Reported by: Shawn Wallbridge of imaginaryforces.com
Tested by: John-Mark Gurney <jmg@funkthat.com>
PR: 183424


# 256281 10-Oct-2013 gjb

Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


# 253974 05-Aug-2013 mckusick

With the addition of journalled soft updates, the "newblk" structures
persist much longer than previously. Historically we had at most 100
entries; now the count may reach a million. With the increased count
we spent far too much time looking them up in the grossly undersized
newblk hash table. Configure the newblk hash table to accurately reflect
the number of entries that it must index.

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks


# 253973 05-Aug-2013 mckusick

To better understand performance problems with journalled soft updates,
we need to collect the highest level of allocation for each of the
different soft update dependency structures. This change collects these
statistics and makes them available using `sysctl debug.softdep.highuse'.

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks


# 251171 31-May-2013 jeff

- Convert the bufobj lock to rwlock.
- Use a shared bufobj lock in getblk() and inmem().
- Convert softdep's lk to rwlock to match the bufobj lock.
- Move INFREECNT to b_flags and protect it with the buf lock.
- Remove unnecessary locking around bremfree() and BKGRDINPROG.

Sponsored by: EMC / Isilon Storage Division
Discussed with: mckusick, kib, mdf


# 250901 22-May-2013 mckusick

Properly spell sentinel (missed in 250891)
No functional changes.

Spotted by: Navdeep Parhar and Alexey Dokuchaev
MFC after: 2 weeks


# 250897 22-May-2013 mckusick

Add missing buffer releases (brelse) after bread calls that return
an error. One could argue that returning a buffer even when it is
not valid is incorrect, but bread has always returned a buffer
valid or not.

Reviewed by: kib
MFC after: 2 weeks


# 250895 22-May-2013 mckusick

Add missing 28th element to softdep types name array.

Found by: Coverity Scan, CID 1007621
Reviewed by: kib
MFC after: 2 weeks


# 250894 22-May-2013 mckusick

Null a pointer after it is freed so that when it is returned
the return value is NULL. Based on the returned flags, the
return value should never be inspected in the case where NULL
is returned, but it is good coding practice not to return a
pointer to freed memory.

Found by: Coverity Scan, CID 1006096
Reviewed by: kib
MFC after: 2 weeks


# 250892 22-May-2013 mckusick

Remove a bogus check for a NULL buffer pointer.
Add a KASSERT that it is not NULL.

Found by: Coverity Scan, CID 1009114
Reviewed by: kib
MFC after: 2 weeks


# 250891 22-May-2013 mckusick

Properly spell sentinel (not sintenel or sentinal).
No functional changes.

Spotted by: kib
MFC after: 2 weeks


# 249218 06-Apr-2013 jeff

Prepare to replace the buf splay with a trie:

- Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists.
No consumers need to find them there and it complicates the tree.
These flags are all FFS specific and could be moved out of the buf
cache.
- Use pbgetvp() and pbrelvp() to associate the background and journal
bufs with the vp. Not only is this much cheaper it makes more sense
for these transient bufs.
- Fix the assertions in pbget* and pbrel*. It's not safe to check list
pointers which were never initialized. Use the BX flags instead. We
also check B_PAGING in reassignbuf() so this should cover all cases.

Discussed with: kib, mckusick, attilio
Sponsored by: EMC / Isilon Storage Division


# 249064 03-Apr-2013 mckusick

The code in clear_remove() and clear_inodedeps() skips one entry
in the pagedep and inodedep hash tables. An entry in the table is
skipped because 'pagedep_hash' and 'inodedep_hash' hold the size
of the hash tables - 1.

The chance that this would have any operational failure is extremely
unlikely. These funtions only need to find a single entry and are
only called when there are too many entries. The chance that they
would fail because all the entries are on the single skipped hash
chain are remote.

Submitted by: Pedro Martelletto
Reviewed by: kib
MFC after: 2 weeks


# 247388 27-Feb-2013 kib

The softdep freeblks workitem might hold a reference on the dquot.
Current dqflush() panics when a dquot with with non-zero refcount is
encountered. The situation is possible, because quotas are turned off
before softdep workitem queue if flushed, due to the quota file writes
might create softdep workitems.

Make the encountering an active dquot in dqflush() not fatal, return
the error from quotaoff() instead. Ignore the quotaoff() failures
when ffs_flushfiles() is called in the course of softdep_flushfiles()
loop, until the last iteration. At the last loop, the quotas must be
closed, and because SU workitems should be already flushed, the
references to dquot are gone.

Sponsored by: The FreeBSD Foundation
Reported and tested by: pho
Reviewed by: mckusick
MFC after: 2 weeks


# 245286 11-Jan-2013 kib

Add flags argument to vfs_write_resume() and remove
vfs_write_resume_flags().

Sponsored by: The FreeBSD Foundation


# 244534 21-Dec-2012 attilio

Fixup r218424: uio_yield() was scaling directly to userland priority.
When kern_yield() was introduced with the possibility to specify
a new priority, the behaviour changed by not lowering priority at all
in the consumers, making the yielding mechanism highly ineffective for
high priority kthreads like bufdaemon, syncer, vlrudaemon, etc.
There are no evidences that consumers could bear with such change in
semantic and this situation could finally lead to bugs similar to the
ones fixed in r244240.
Re-specify userland pri for kthreads involved.

Tested by: pho
Reviewed by: kib, mdf
MFC after: 1 week


# 243018 14-Nov-2012 jeff

- Fix a truncation bug with softdep journaling that could leak blocks on
crash. When truncating a file that never made it to disk we use the
canceled allocation dependencies to hold the journal records until
the truncation completes. Previously allocdirect dependencies on
the id_bufwait list were not considered and their journal space
could expire before the bitmaps were written. Cancel them and attach
them to the freeblks as we do for other allocdirects.
- Add KTR traces that were used to debug this problem.
- When adding jsegdeps, always use jwork_insert() so we don't have more
than one segdep on a given jwork list.

Sponsored by: EMC / Isilon Storage Division


# 242924 12-Nov-2012 jeff

- Fix a bug that has existed since the original softdep implementation.
When a background copy of a cg is written we complete any work associated
with that bmsafemap. If new work has been added to the non-background
copy of the buffer it will be completed before the next write happens.
The solution is to do the rollbacks when we make the copy so only those
dependencies that were present at the time of writing will be completed
when the background write completes. This would've resulted in various
bitmap related corruptions and panics. It also would've expired journal
entries early causing journal replay to miss some records.

MFC after: 2 weeks


# 242815 09-Nov-2012 jeff

- Correct rev 242734, segments can sometimes get stuck. Be a bit more
defensive with segment state.

Reported by: b. f. <bf1783@googlemail.com>


# 242734 08-Nov-2012 jeff

- Implement BIO_FLUSH support around journal entries. This will not 100%
solve power loss problems with dishonest write caches. However, it
should improve the situation and force a full fsck when it is unable
to resolve with the journal.
- Resolve a case where the journal could wrap in an unsafe way causing
us to prematurely lose journal entries in very specific scenarios.

Discussed with: mckusick
MFC after: 1 month


# 242492 02-Nov-2012 jeff

- In cancel_mkdir_dotdot don't panic if the inodedep is not available. If
the previous diradd had already finished it could have been reclaimed
already. This would only happen under heavy dependency pressure.

Reported by: Andrey Zonov <zont@FreeBSD.org>
Discussed with: mckusick
MFC after: 1 week


# 242259 28-Oct-2012 trasz

Fix two problems that caused instant panic when the device mounted
with softupdates went away. Note that this does not fix the problem
entirely; I'm committing it now to make it easier for someone to pick
up the work.

Reviewed by: mckusick


# 241896 22-Oct-2012 kib

Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho


# 241011 27-Sep-2012 mdf

Fix up kernel sources to be ready for a 64-bit ino_t.

Original code by: Gleb Kurtsou


# 236937 11-Jun-2012 mckusick

In softdep_setup_inomapdep() we may have to allocate both inodedep
and bmsafemap dependency structures in inodedep_lookup() and
bmsafemap_lookup() respectively. The setup of these structures must
be done while holding the soft-dependency mutex. If the inodedep is
allocated first, it may be freed in the I/O completion callback when
the mutex is released to allocate the bmsafemap. If the bmsafemap is
allocated first, it may be freed in the I/O completion callback when
the mutex is released to allocate the inodedep.

To resolve this problem, bmsafemap_lookup has had a parameter added
that allows a pre-malloc'ed bmsafemap to be passed in so that it does
not need to release the mutex to create a new bmsafemap. The
softdep_setup_inomapdep() routine pre-malloc's a bmsafemap dependency
before acquiring the mutex and starting to build the inodedep with a
call to inodedep_lookup(). The subsequent call to bmsafemap_lookup()
is passed this pre-allocated bmsafemap entry so that it need not
release the mutex if it needs to create a new one.

Reported by: Peter Holm
Tested by: Peter Holm
MFC after: 1 week


# 235610 18-May-2012 mckusick

Add missing `continue' statement at end of case.

Found by: Kevin Lo (kevlo@)
MFC after: 1 week


# 234608 23-Apr-2012 trasz

Remove unused thread argument from clear_inodeps() and clear_remove().


# 234386 17-Apr-2012 mckusick

Replace the MNT_VNODE_FOREACH interface with MNT_VNODE_FOREACH_ALL.
The primary changes are that the user of the interface no longer
needs to manage the mount-mutex locking and that the vnode that
is returned has its mutex locked (thus avoiding the need to check
to see if its is DOOMED or other possible end of life senarios).

To minimize compatibility issues for third-party developers, the
old MNT_VNODE_FOREACH interface will remain available so that this
change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH
will be removed in head.

The reason for this update is to prepare for the addition of the
MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the
active vnodes associated with a mount point (typically less than
1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks


# 233817 02-Apr-2012 mckusick

A file cannot be deallocated until its last name has been removed
and it is no longer referenced by a user process. The inode for a
file whose name has been removed, but is still referenced at the
time of a crash will still be allocated in the filesystem, but will
have no references (e.g., they will have no names referencing them
from any directory).

With traditional soft updates these unreferenced inodes will be
found and reclaimed when the background fsck is run. When using
journaled soft updates, the kernel must keep track of these inodes
so that it can find and reclaim them during the cleanup process.
Their existence cannot be stored in the journal as the journal only
handles short-term events, and they may persist for days. So, they
are tracked by keeping them in a linked list whose head pointer is
stored in the superblock. The journal tracks them only until their
linked list pointers have been commited to disk. Part of the cleanup
process involves traversing the list of unreferenced inodes and
reclaiming them.

This bug was triggered when confusion arose in the commit steps
of keeping the unreferenced-inode linked list coherent on disk.
Notably, a race between the link() system call adding a link-count
to a file and the unlink() system call removing a link-count to
the file. Here if the unlink() ran after link() had looked up
the file but before link() had incremented the link-count of the
file, the file's link-count would drop to zero before the link()
incremented it back up to one. If the file was referenced by a
user process, the first transition through zero made it appear
that it should be added to the unreferenced-inode list when in
fact it should not have been added. If the new name created by
link() was deleted within a few seconds (with the file still
referenced by a user process) it would legitimately be a candidate
for addition to the unreferenced-inode list. The result was that
there were two attempts to add the same inode to the unreferenced-inode
list which scrambled the unreferenced-inode list's pointers leading
to a panic. The fix is to detect and avoid the false attempt at
adding it to the unreferenced-inode list by having the link()
system call check to see if the link count is zero before it
increments it. If it is, the link() fails with ENOENT (showing that
it has failed the link()/unlink() race).

While tracking down this bug, we have added additional assertions
to detect the problem sooner and also simplified some of the code.

Reported by: Kirk Russell
Fix submitted by: Jeff Roberson
Tested by: Peter Holm
PR: kern/159971
MFC (to 9 only): 2 weeks


# 233438 25-Mar-2012 mckusick

Add a third flags argument to ffs_syncvnode to avoid a possible conflict
with MNT_WAIT flags that passed in its second argument. This will be
MFC'ed together with r232351.

Discussed with: kib


# 232948 13-Mar-2012 kib

Supply boolean as the second argument to ffs_update(), and not a
MNT_[NO]WAIT constants, which in fact always caused sync operation.

Based on the submission by: bde
Reviewed by: mckusick
MFC after: 2 weeks


# 232709 09-Mar-2012 kib

Decomission mnt_noasync. Introduce MNTK_NOASYNC mnt_kern_flag which
allows a filesystem to request VFS to not allow MNTK_ASYNC.

MFC after: 1 week


# 232351 01-Mar-2012 mckusick

This change avoids a kernel deadlock on "snaplk" when using
snapshots on UFS filesystems running with journaled soft updates.
This is the first of several bugs that need to be fixed before
removing the restriction added in -r230250 to prevent the use
of snapshots on filesystems running with journaled soft updates.

The deadlock occurs when holding the snapshot lock (snaplk)
and then trying to flush an inode via ffs_update(). We become
blocked by another process trying to flush a different inode
contained in the same inode block that we need. It holds the
inode block for which we are waiting locked. When it tries to
write the inode block, it gets blocked waiting for the our
snaplk when it calls ffs_copyonwrite() to see if the inode
block needs to be copied in our snapshot.

The most obvious place that this deadlock arises is in the
ffs_copyonwrite() routine when it updates critical metadata
in a snapshot and tries to write it out before proceeding.
The fix here is to write the data and indirect block pointer
for the snapshot, but to skip the call to ffs_update() to
write the snapshot inode. To ensure that we will never have
to update a pointer in the inode itself, the ffs_snapshot()
routine that creates the snapshot has to ensure that all the
direct blocks are allocated as part of the creation of the
snapshot.

A less obvious place that this deadlock occurs is when we hold
the snaplk because we are deleting a snapshot. In the course of
doing the deletion, we need to allocate various soft update
dependency structures and allocate some journal space. If we
hit a resource limit while doing this we decrease the resources
in use by flushing out an existing dirty file to get it to give
up the soft dependency resources that it holds. The flush can
cause an ffs_update() to be done on the inode for the file that
we have selected to flush resulting in the same deadlock as
described above when the inode that we have chosen to flush
resides in the same inode block as the snapshot inode that we hold.
The fix is to defer cleaning up any time that the inode on which
we are operating is a snapshot.

Help and review by: Jeff Roberson
Tested by: Peter Holm
MFC (to 9 only) after: 2 weeks


# 231572 13-Feb-2012 mckusick

Missing conditions in checking whether an inode has been written.

Found and tested by: Peter Holm
MFC after: 2 weeks (to 9 only)


# 231091 06-Feb-2012 kib

Add missing opt_quota.h include to activate #ifdef QUOTA blocks,
apparently a step in unbreaking QUOTA support.

Reported and tested by: Adam Strohl <adams-freebsd ateamsystems com>
MFC after: 1 week


# 231077 06-Feb-2012 kib

JNEWBLK dependency may legitimately appear on the buf dependency
list. If softdep_sync_buf() discovers such dependency, it should do
nothing, which is safe as it is only waiting on the parent buffer to
be written, so it can be removed.

Committed on behalf of: jeff
MFC after: 1 week


# 227309 07-Nov-2011 ed

Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.

The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.


# 225700 20-Sep-2011 kib

Use nowait sync request for a vnode when doing softdep cleanup. We possibly
own the unrelated vnode lock, doing waiting sync causes deadlocks.

Reported and tested by: pho
Approved by: re (bz)


# 225166 25-Aug-2011 mm

Generalize ffs_pages_remove() into vn_pages_remove().

Remove mapped pages for all dataset vnodes in zfs_rezget() using
new vn_pages_remove() to fix mmapped files changed by
zfs rollback or zfs receive -F.

PR: kern/160035, kern/156933
Reviewed by: kib, pjd
Approved by: re (kib)
MFC after: 1 week


# 224503 30-Jul-2011 mckusick

Update to -r224294 to ensure that only one of MNT_SUJ or MNT_SOFTDEP
is set so that mount can revert back to using MNT_NOWAIT when doing
getmntinfo.

Approved by: re (kib)


# 224294 24-Jul-2011 mckusick

Move the MNTK_SUJ flag in mnt_kern_flag to MNT_SUJ in mnt_flag
so that it is visible to userland programs. This change enables
the `mount' command with no arguments to be able to show if a
filesystem is mounted using journaled soft updates as opposed
to just normal soft updates.

Approved by: re (bz)


# 224027 14-Jul-2011 mckusick

Consistently check mount flag (MNTK_SUJ) rather than superblock
flag (FS_SUJ) when determining whether to do journaling-based
operations. The mount flag is set only when journaling is active
while the superblock flag is set to indicate that journaling is to
be used. For example, when the filesystem is mounted read-only, the
journaling may be present (FS_SUJ) but not active (MNTK_SUJ).
Inappropriate checking of the FS_SUJ flag was causing some
journaling actions to be attempted at inappropriate times.


# 223772 04-Jul-2011 jeff

- Speed up pendingblock processing again. Having too much delay between
ffs_blkfree() and the pending adjustment causes all kinds of
space related problems.


# 223771 04-Jul-2011 jeff

- Handle D_JSEGDEP in the softdep_sync_buf() switch. These can now
find themselves on snapshot vnodes.

Reported by: pho


# 223770 04-Jul-2011 jeff

- It is impossible to run request_cleanup() while doing a copyonwrite.
This will most likely cause new block allocations which can recurse
into request cleanup.
- While here optimize the ufs locking slightly. We need only acquire and
drop once.
- process_removes() and process_truncates() also is only needed once.
- Attempt to flush each item on the worklist once but do not loop forever
if some can not be completed.

Discussed with: mckusick


# 223687 29-Jun-2011 mckusick

Handle the FREEDEP case in softdep_sync_buf().
This fix failed to get added in -r223325.

Submitted by: Peter Holm


# 223325 20-Jun-2011 jeff

- Fix directory count rollbacks by passing the mode to the journal dep
earlier.
- Add rollback/forward code for frag and cluster accounting.
- Handle the FREEDEP case in softdep_sync_buf(). (submitted by pho)


# 223127 15-Jun-2011 mckusick

Ensure that filesystem metadata contained within persistent snapshots
is always kept consistent.

Suggested by: Jeff Roberson


# 223105 15-Jun-2011 mckusick

Missing cleanup case after completion of a snapshot vnode write
claiming a released block.

Submitted by: Jeff Roberson
Tested by: Peter Holm


# 223020 12-Jun-2011 mckusick

Update to soft updates journaling to properly track freed blocks
that get claimed by snapshots.

Submitted by: Jeff Roberson
Tested by: Peter Holm


# 223018 12-Jun-2011 mckusick

Disable the soft updates journaling after a filesystem is successfully
downgraded to read-only. It will be restarted if the filesystem is
upgraded back to read-write.


# 222958 10-Jun-2011 jeff

Implement fully asynchronous partial truncation with softupdates journaling
to resolve errors which can cause corruption on recovery with the old
synchronous mechanism.

- Append partial truncation freework structures to indirdeps while
truncation is proceeding. These prevent new block pointers from
becoming valid until truncation completes and serialize truncations.
- On completion of a partial truncate journal work waits for zeroed
pointers to hit indirects.
- softdep_journal_freeblocks() handles last frag allocation and last
block zeroing.
- vtruncbuf/ffs_page_remove moved into softdep_*_freeblocks() so it
is only implemented in one place.
- Block allocation failure handling moved up one level so it does not
proceed with buf locks held. This permits us to do more extensive
reclaims when filesystem space is exhausted.
- softdep_sync_metadata() is broken into two parts, the first executes
once at the start of ffs_syncvnode() and flushes truncations and
inode dependencies. The second is called on each locked buf. This
eliminates excessive looping and rollbacks.
- Improve the mechanism in process_worklist_item() that handles
acquiring vnode locks for handle_workitem_remove() so that it works
more generally and does not loop excessively over the same worklist
items on each call.
- Don't corrupt directories by zeroing the tail in fsck. This is only
done for regular files.
- Push a fsync complete record for files that need it so the checker
knows a truncation in the journal is no longer valid.

Discussed with: mckusick, kib (ffs_pages_remove and ffs_truncate parts)
Tested by: pho


# 221829 13-May-2011 mdf

Use a name instead of a magic number for kern_yield(9) when the priority
should not change. Fetch the td_user_pri under the thread lock. This
is probably not necessary but a magic number also seems preferable to
knowing the implementation details here.

Requested by: Jason Behmer < jason DOT behmer AT isilon DOT com >


# 220532 11-Apr-2011 jeff

- Refactor softdep_setup_freeblocks() into a set of functions to prepare
for a new journal specific partial truncate routine.
- Use dep_current[] in place of specific dependency counts. This is
automatically maintained when workitems are allocated and has
less risk of becoming incorrect.


# 220511 10-Apr-2011 jeff

Fix a long standing SUJ performance problem:

- Keep a hash of indirect blocks that have recently been freed and are
still referenced in the journal.
- Lookup blocks in this hash before forcing a new block write to wait on
the journal entry to hit the disk. This is only necessary to avoid
confusion between old identities as indirects and new identities as
file blocks.
- Don't free jseg structures until the journal has written a record that
invalidates it. This keeps the indirect block information around for
as long as is required to be safe.
- Force an empty journal block write when required to flush out stale
journal data that is simply waiting for the oldest valid sequence
number to advance beyond it.


# 220406 07-Apr-2011 jeff

- Don't invalidate jnewblks immediately upon discovering that the block
will be removed. Permit the journal to proceed so that we don't leave
a rollback in a cg for a very long time as this can cause terrible perf
problems in low memory situations.

Tested by: pho


# 220374 05-Apr-2011 mckusick

Be far more persistent in reclaiming blocks and inodes before giving
up and declaring a filesystem out of space. Especially necessary when
running on a small filesystem. With this improvement, it should be
possible to use soft updates on a small root filesystem.

Kudos to: Peter Holm
Testing by: Peter Holm
MFC: 2 weeks


# 220282 02-Apr-2011 jeff

Fix problems that manifested from filesystem full conditions:

- In softdep_revert_mkdir() find the dotaddref before we attempt to cancel
the jaddref so we can make assumptions about where the dotaddref is on
the list. cancel_jaddref() does not always remove items from the list
anymore.
- Always set GOINGAWAY on an inode in softdep_freefile() if DEPCOMPLETE
was never set. This ensures that dependencies will continue to be
processed on the inowait/bufwait list and is more an artifact of
the structure of the code than a pure ordering problem.
- Always set DEPCOMPLETE on canceled jaddrefs so that they can be freed
appropriately. This normally occurs when the refs are added to the
journal but if they are canceled before this point the state would
never be set and the dependency could never be freed.

Reported by: pho
Tested by: pho


# 220099 28-Mar-2011 kib

Fix the softdep_request_cleanup() function definition for !SOFTUPDATES case.

Submitted by: Aleksandr Rybalko <ray dlink ua>


# 219895 23-Mar-2011 mckusick

Add retry code analogous to the block allocation retry code
to avoid running out of inodes.

Reported by: Peter Holm


# 218602 12-Feb-2011 kib

Use the native sector size of the device backing the UFS volume for SU+J
journal blocks, instead of hard coding 512 byte sector size. Journal need
to atomically write the block, that can only be guaranteed at the device
sector size, not larger. Attempt to write less then sector size results in
driver errors.

Note that this is the first structure in UFS that depends on the
sector size. Other elements are written in the units of fragments.

In collaboration with: pho
Reviewed by: jeff
Tested by: bz, pho


# 218485 09-Feb-2011 netchild

Add some FEATURE macros for some UFS features.

SU+J is not included as a FEATURE macro:
- it was not in the tree during the GSoC
- I do not see an option to en-/disable it in NOTES

Two minor changes where made during the review compared to what was developed
during GSoC 2010.

No FreeBSD version bump, the userland application to query the features will
be committed last and can serve as an indication of the availablility if
needed.

Sponsored by: Google Summer of Code 2010
Submitted by: kibab
Reviewed by: kib
X-MFC after: to be determined in last commit with code from this project


# 218424 08-Feb-2011 mdf

Based on discussions on the svn-src mailing list, rework r218195:

- entirely eliminate some calls to uio_yeild() as being unnecessary,
such as in a sysctl handler.

- move should_yield() and maybe_yield() to kern_synch.c and move the
prototypes from sys/uio.h to sys/proc.h

- add a slightly more generic kern_yield() that can replace the
functionality of uio_yield().

- replace source uses of uio_yield() with the functional equivalent,
or in some cases do not change the thread priority when switching.

- fix a logic inversion bug in vlrureclaim(), pointed out by bde@.

- instead of using the per-cpu last switched ticks, use a per thread
variable for should_yield(). With PREEMPTION, the only reasonable
use of this is to determine if a lock has been held a long time and
relinquish it. Without PREEMPTION, this is essentially the same as
the per-cpu variable.


# 218195 02-Feb-2011 mdf

Put the general logic for being a CPU hog into a new function
should_yield(). Use this in various places. Encapsulate the common
case of check-and-yield into a new function maybe_yield().

Change several checks for a magic number of iterations to use
should_yield() instead.

MFC after: 1 week


# 217326 12-Jan-2011 mdf

sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly.

Commit the kernel changes.


# 216951 04-Jan-2011 kib

Instead of incrementing freework reference counter in indir_trunc(), do
it at the allocation time for journaled fs and indirect blocks, when
the allocated object is not accessible outside.

Requested and reviewed by: jeff
Tested by: pho


# 216818 30-Dec-2010 kib

Handle missing jremrefs when a directory is renamed overtop of
another, deleting it. If the directory is removed, UFS always need to
remove the .. ref, even if the ultimate ref on the parent would not
change. The new directory must have a new journal entry for that ref.
Otherwise journal processing would not properly account for the
parent's reference since it will belong to a removed directory entry.

Change ufs_rename()'s dotdot rename section to always
setup_dotdot_link(). In the tip != NULL case SUJ needs the newref dependency
allocated via setup_dotdot_link().

Stop setting isrmdir to 2 for newdirrem() in softdep_setup_remove().
Remove the isdirrem > 1 checks from newdirrem().

Reported by: many
Submitted by: jeff
Tested by: pho


# 216817 30-Dec-2010 kib

In indir_trunc(), when processing jnewblk entries that are not written
to the disk, recurse to handle indirect blocks of next level that are
hidden by the corresponding entry.

In collaboration with: pho
Reviewed by: jeff, mckusick
Tested by: mckusick, pho


# 216795 29-Dec-2010 kib

Move the definition of mkdirlisthd from header to C file.

Reviewed by: mckusick
Tested by: pho


# 216676 23-Dec-2010 mckusick

This patch fixes a soft update panic while running perl 5.12 tests
which produced:

panic: indir_trunc: Index out of range -148 parent -2061 lbn -305164

Reported by: Dimitry Andric
Fixed by: Jeff Roberson


# 215950 27-Nov-2010 pho

First step in fixing the handle_workitem_freeblocks panic.

In collaboration with: kib


# 215117 11-Nov-2010 kib

The softdep_setup_freeblocks() adds worklist items before
deallocate_dependencies() is done. This opens a race between softdep
thread and the thread that does the truncation:
A write of the indirect block causes the freeblks to become
ALLCOMPLETE while softdep_setup_freeblocks() dropped softdep lock. And
then, softdep_disk_write_complete() would reassign the workitem to the
mount point worklist, causing premature processing of the workitem, or
journal write exhaust the fb_jfreeblkhd and handle_written_jfreeblk does
the same reassign.
indir_trunc() then would find the indirect block that is locked (with lock
owned by kernel) but without any dependencies, causing it to hang in
getblk() waiting for buffer lock.

Do not mark freeblks as DEPCOMPLETE until deallocate_dependencies()
finished.

Analyzed, suggested and reviewed by: jeff
Tested by: pho


# 215115 11-Nov-2010 kib

Change #ifdef INVARIANTS panic into KASSERT, and print some useful
information to diagnose the issue, in handle_complete_freeblocks().

Reviewed by: jeff
Tested by: pho


# 215114 11-Nov-2010 kib

In journal_mount(), only set MNTK_SUJ flag after the jblocks are mapped.
I believe there is a window otherwise where jblocks can be accessed
without proper initialization.

Reviewed by: jeff
Tested by: pho


# 215113 11-Nov-2010 kib

Add function lbn_offset to calculate offset of the indirect block of
given level.

Reviewed by: jeff
Tested by: pho


# 215112 11-Nov-2010 kib

Fix typo. Function is called ffs_blkfree.


# 213363 02-Oct-2010 alc

M_USE_RESERVE has been deprecated for a decade. Eliminate any uses that
have no run-time effect.


# 213275 29-Sep-2010 mckusick

Since local variable 'i' is used only in a KASSERT, declare and
initialize it only if INVARIANTS is defined to avoid a declared
but unused warning.

Suggested by: Brian Somers <brian@FreeBSD.org>


# 213259 29-Sep-2010 kib

Fix typo in comment.


# 212617 14-Sep-2010 mckusick

Update comments in soft updates code to more fully describe
the addition of journalling. Only functional change is to
tighten a KASSERT.

Reviewed by: jeff Roberson


# 211531 20-Aug-2010 jhb

Add dedicated routines to toggle lockmgr flags such as LK_NOSHARE and
LK_CANRECURSE after a lock is created. Use them to implement macros that
otherwise manipulated the flags directly. Assert that the associated
lockmgr lock is exclusively locked by the current thread when manipulating
these flags to ensure the flag updates are safe. This last change required
some minor shuffling in a few filesystems to exclusively lock a brand new
vnode slightly earlier.

Reviewed by: kib
MFC after: 3 days


# 211212 12-Aug-2010 kib

Softdep_process_worklist() should unsuspend not only before processing
the worklist (in softdep_process_journal), but also after flushing the
workitems. Might be, we should even do this before bwillwrite() too, but
this seems to be not needed for now.

Fs might be suspended during processing the queue, and then there is
nobody around to unsuspend.

In collaboration with: pho
Tested by: bz
Reviewed by: jeff


# 209717 06-Jul-2010 jeff

- Handle the truncation of an inode with an effective link count of 0 in
the context of the process that reduced the effective count. Previously
all truncation as a result of unlink happened in the softdep flush
thread. This had the effect of being impossible to rate limit properly
with the journal code. Now the process issuing unlinks is suspended
when the journal files. This has a side-effect of improving rm
performance by allowing more concurrent work.
- Handle two cases in inactive, one for effnlink == 0 and another when
nlink finally reaches 0.
- Eliminate the SPACECOUNTED related code since the truncation is no
longer delayed.

Discussed with: mckusick


# 209057 11-Jun-2010 avg

ffs_softdep: change K&R in function defintions to ANSI prototypes

Apparently it's bad when we first have an ANSI prototype in function
declaration, but then use K&R in its defintion.

Complaint from: clang
MFC after: 2 weeks


# 208287 19-May-2010 jeff

- Don't immediately re-run softdepflush if we didn't make any progress
on the last iteration. This can lead to a deadlock when we have
worklist items that cannot be immediately satisfied.

Reported by: uqs, Dimitry Andric <dimitry@andric.com>

- Remove some unnecessary debugging code and place some other under
SUJ_DEBUG.
- Examine the journal state in softdep_slowdown().
- Re-format some comments so I may more easily add flag descriptions.


# 207742 07-May-2010 jeff

- Call softdep_prealloc() before any of the balloc routines in the
snapshot code.
- Don't fsync() vnodes in prealloc if copy on write is in progress. It
is not safe to recurse back into the write path here.

Reported by: Vladimir Grebenschikov <vova@fbsd.ru>


# 207741 07-May-2010 jeff

- Use the correct flag mask when determining whether an inode has
successfully made it to the free list yet or not. This fixes
a deadlock that can occur with unlinked but referenced files.
Journal space and inodedeps were not correctly reclaimed because
the inode block was not left dirty.

Tested/Reported by: lwindschuh@googlemail.com


# 207310 28-Apr-2010 jeff

- When canceling jaddrefs they may not yet be in the journal if this is via
a revert call. In this case don't attempt to remove something that
has not yet been added. Otherwise this jaddref must hang around
to prevent the bitmap write as normal.


# 207309 28-Apr-2010 jeff

- Fix builds without SOFTUPDATES defined in the kernel config.


# 207142 24-Apr-2010 pjd

Fix build for UFS without SOFTUPDATES.


# 207141 24-Apr-2010 jeff

- Merge soft-updates journaling from projects/suj/head into head. This
brings in support for an optional intent log which eliminates the need
for background fsck on unclean shutdown.

Sponsored by: iXsystems, Yahoo!, and Juniper.
With help from: McKusick and Peter Holm


# 196888 06-Sep-2009 kib

The clear_remove() and clear_inodedeps() call vn_start_write(NULL, &mp,
V_NOWAIT) on the non-busied mount point. Unmount might free ufs-specific
mp data, causing ffs_vgetf() to access freed memory.

Busy mountpoint before dropping softdep lk.

Noted and reviewed by: tegge
Tested by: pho
MFC after: 1 week


# 196206 14-Aug-2009 kib

When a UFS node is truncated to the zero length, e.g. by explicit
truncate(2) call, or by being removed or truncated on open, either
new softupdate freeblks structure is allocated to track the freed
blocks of the node, or truncation is done syncronously when too many SU
dependencies are accumulated. The decision does not take into account
the allocated freeblks dependencies, allowing workloads that do huge
amount of truncations to exhaust the kernel memory.

Take the number of allocated freeblks into consideration for
softdep_slowdown().

Reported by: pluknet gmail com
Diagnosed and tested by: pho
Approved by: re (rwatson)
MFC after: 1 month


# 195294 02-Jul-2009 kib

In vn_vget_ino() and their inline equivalents, mnt_ref() the mount point
around the sequence that drop vnode lock and then busies the mount point.
Not having vlocked node or direct reference to the mp allows for the
forced unmount to proceed, making mp unmounted or reused.

Tested by: pho
Reviewed by: jeff
Approved by: re (kensmith)
MFC after: 2 weeks


# 195186 30-Jun-2009 kib

Softdep_fsync() may need to lock parent directory of the synced vnode.
Use inlined (due to FFSV_FORCEINSMQ) version of vn_vget_ino() to prevent
mountpoint from being unmounted and freed while no vnodes are locked.

Tested by: pho
Approved by: re (kensmith)
MFC after: 1 month


# 193307 02-Jun-2009 attilio

Handle lock recursion differenty by always checking against LO_RECURSABLE
instead the lock own flag itself.

Tested by: pho


# 190690 04-Apr-2009 kib

When removing or renaming snaphost, do not delve into request_cleanup().
The later may need blocks from the underlying device that belongs
to normal files, that should not be locked while snap lock is held.

Reported and tested by: pho
MFC after: 1 month


# 184554 02-Nov-2008 attilio

Improve VFS locking:
- Implement real draining for vfs consumers by not relying on the
mnt_lock and using instead a refcount in order to keep track of lock
requesters.
- Due to the change above, remove the mnt_lock lockmgr because it is now
useless.
- Due to the change above, vfs_busy() is no more linked to a lockmgr.
Change so its KPI by removing the interlock argument and defining 2 new
flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the
old version (which was unlinked from the lockmgr alredy) and
MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx
once the mnt interlock is held (ability still desired by most consumers).
- The stub used into vfs_mount_destroy(), that allows to override the
mnt_ref if running for more than 3 seconds, make it totally useless.
Remove it as it was thought to work into older versions.
If a problem of "refcount held never going away" should appear, we will
need to fix properly instead than trust on such hackish solution.
- Fix a bug where returning (with an error) from dounmount() was still
leaving the MNTK_MWAIT flag on even if it the waiters were actually
woken up. Just a place in vfs_mount_destroy() is left because it is
going to recycle the structure in any case, so it doesn't matter.
- Remove the markercnt refcount as it is useless.

This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and
__FreeBSD_version will be modified accordingly.

Discussed with: kib
Tested by: pho


# 184214 23-Oct-2008 des

Fix a number of style issues in the MALLOC / FREE commit. I've tried to
be careful not to fix anything that was already broken; the NFSv4 code is
particularly bad in this respect.


# 184205 23-Oct-2008 des

Retire the MALLOC and FREE macros. They are an abomination unto style(9).

MFC after: 3 months


# 183072 16-Sep-2008 kib

Add the ffs structures introspection functions for ddb.
Show the b_dep value for the buffer in the show buffer command.
Add a comand to dump the dirty/clean buffer list for vnode.

Reviewed by: tegge
Tested and used by: pho
MFC after: 1 month


# 183067 16-Sep-2008 kib

The struct inode *ip supplied to softdep_freefile is not neccessary the
inode having number ino. In r170991, the ip was marked IN_MODIFIED, that
is not quite correct.

Mark only the right inode modified by checking inode number.

Reviewed by: tegge
In collaboration with: pho
MFC after: 1 month


# 182542 31-Aug-2008 attilio

Decontextualize vfs_busy(), vfs_unbusy() and vfs_mount_alloc() functions.

Manpages are updated accordingly.

Tested by: Diego Sardina <siarodx at gmail dot com>


# 182365 28-Aug-2008 kib

Softdep code may need to instantiate vnode when processing
dependencies. In particular, it may need this while syncing filesystem
being unmounted. Since during unmount MNTK_NOINSMNTQUE flag is set,
that could sometimes disallow insertion of the vnode into the vnode
mount list, softdep code needs to overwrite the MNTK_NOINSMNTQUE flag.

Create the ffs_vgetf() function that sets the VV_FORCEINSMQ flag for
new vnode and use it consistently from the softdep code instead of
ffs_vget().

Add the retry logic to the softdep_flushfiles() to flush the vnodes
that could be instantiated while flushing softdep dependencies.

Tested by: pho, kris
Reviewed by: tegge
MFC after: 1 month


# 177957 06-Apr-2008 attilio

Optimize lockmgr in order to get rid of the pool mutex interlock, of the
state transitioning flags and of msleep(9) callings.
Use, instead, an algorithm very similar to what sx(9) and rwlock(9)
alredy do and direct accesses to the sleepqueue(9) primitive.

In order to avoid writer starvation a mechanism very similar to what
rwlock(9) uses now is implemented, with the correspective per-thread
shared lockmgrs counter.

This patch also adds 2 new functions to lockmgr KPI: lockmgr_rw() and
lockmgr_args_rw(). These two are like the 2 "normal" versions, but they
both accept a rwlock as interlock. In order to realize this, the general
lockmgr manager function "__lockmgr_args()" has been implemented through
the generic lock layer. It supports all the blocking primitives, but
currently only these 2 mappers live.

The patch drops the support for WITNESS atm, but it will be probabilly
added soon. Also, there is a little race in the draining code which is
also present in the current CVS stock implementation: if some sharers,
once they wakeup, are in the runqueue they can contend the lock with
the exclusive drainer. This is hard to be fixed but the now committed
code mitigate this issue a lot better than the (past) CVS version.
In addition assertive KA_HELD and KA_UNHELD have been made mute
assertions because they are dangerous and they will be nomore supported
soon.

In order to avoid namespace pollution, stack.h is splitted into two
parts: one which includes only the "struct stack" definition (_stack.h)
and one defining the KPI. In this way, newly added _lockmgr.h can
just include _stack.h.

Kernel ABI results heavilly changed by this commit (the now committed
version of "struct lock" is a lot smaller than the previous one) and
KPI results broken by lockmgr_rw() / lockmgr_args_rw() introduction,
so manpages and __FreeBSD_version will be updated accordingly.

Tested by: kris, pho, jeff, danger
Reviewed by: jeff
Sponsored by: Google, Summer of Code program 2007


# 177528 23-Mar-2008 kib

Yield the cpu in the kernel while iterating the list of the
vnodes belonging to the mountpoint. Also, yield when in the
softdep_process_worklist() even when we are not going to sleep due to
buffer drain.

It is believed that the ULE fixed the problem [1], but the yielding
seems to be needed at least for the 4BSD case.

Discussed: on stable@, with bde
Reviewed by: tegge, jeff [1]
MFC after: 2 weeks


# 177493 22-Mar-2008 jeff

- Complete part of the unfinished bufobj work by consistently using
BO_LOCK/UNLOCK/MTX when manipulating the bufobj.
- Create a new lock in the bufobj to lock bufobj fields independently.
This leaves the vnode interlock as an 'identity' lock while the bufobj
is an io lock. The bufobj lock is ordered before the vnode interlock
and also before the mnt ilock.
- Exploit this new lock order to simplify softdep_check_suspend().
- A few sync related functions are marked with a new XXX to note that
we may not properly interlock against a non-zero bv_cnt when
attempting to sync all vnodes on a mountlist. I do not believe this
race is important. If I'm wrong this will make these locations easier
to find.

Reviewed by: kib (earlier diff)
Tested by: kris, pho (earlier diff)


# 177253 16-Mar-2008 rwatson

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink


# 177156 13-Mar-2008 cokane

Replace the non-MPSAFE timeout(9) API in ffs_softdep.c with the MPSAFE
callout_* API (e.g. callout_init_mtx(9)). This was one of the numerous
items on the http://wiki.freebsd.org/SMPTODO list.

Reviewed by: imp, obrien, jhb
MFC after: 1 week


# 177034 10-Mar-2008 emaste

Remove include of opt_quota.h; as of revision 1.205 there is no longer
any #ifdef QUOTA conditional code.


# 176519 24-Feb-2008 attilio

Introduce some functions in the vnode locks namespace and in the ffs
namespace in order to handle lockmgr fields in a controlled way instead
than spreading all around bogus stubs:
- VN_LOCK_AREC() allows lock recursion for a specified vnode
- VN_LOCK_ASHARE() allows lock sharing for a specified vnode

In FFS land:
- BUF_AREC() allows lock recursion for a specified buffer lock
- BUF_NOREC() disallows recursion for a specified buffer lock

Side note: union_subr.c::unionfs_node_update() is the only other function
directly handling lockmgr fields. As this is not simple to fix, it has
been left behind as "sole" exception.


# 175294 13-Jan-2008 attilio

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


# 175202 10-Jan-2008 attilio

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


# 173501 09-Nov-2007 ru

Fix build without INVARIANTS and update a comment to match
a change made in previous revision.


# 173464 08-Nov-2007 obrien

Turn most ffs 'DIAGNOSTIC's into INVARIANTS.


# 172836 20-Oct-2007 julian

Rename the kthread_xxx (e.g. kthread_create()) calls
to kproc_xxx as they actually make whole processes.
Thos makes way for us to add REAL kthread_create() and friends
that actually make theads. it turns out that most of these
calls actually end up being moved back to the thread version
when it's added. but we need to make this cosmetic change first.

I'd LOVE to do this rename in 7.0 so that we can eventually MFC the
new kthread_xxx() calls.


# 170991 22-Jun-2007 kib

Fix livelock that could occur when snapshoting UFS with quotas, where
some quota limit was exceeded. Sequence of UFS_VALLOC()/UFS_VFREE()
call there could cause inodeblock to have both freefile and inodedep
dependencies without any inode in the block being marked for write.
Then, softdep_check_suspend() would return EAGAIN forewer.

Force write of inodeblock with allocated freefile softdependency by
setting IN_MODIFIED flag in softdep_freefile and unconditionally calling
UFS_UPDATE() in ufs_reclaim.

Reported by: kris
Debug help and tested by: Peter Holm
Approved by: re (kensmith)
MFC after: 3 weeks


# 169239 03-May-2007 thompsa

Add a newline to the printf message.


# 168575 10-Apr-2007 kib

Recalculate the NEWBLOCK flag for pagedep structure after the softdep
lock is dropped, since pagedep may be already processed and deallocated.

Found and tested by: kris
MFC after: 2 weeks


# 168574 10-Apr-2007 kib

When LK_NOWAIT is passed as argument to process_worklist_item(), this
does not prevent handle_workitem_remove() from recursing into a blocking
version. Add the dirrem to worklist instead of processing it now if this
is the case.

Reported and tested by: kris
Submitted by: tegge
MFC after: 2 weeks


# 168353 04-Apr-2007 delphij

Use *_EMPTY macros when appropriate.


# 168021 29-Mar-2007 kib

Revert rev. 1.205. Replace unconditional acquision of Giant when QUOTAS are
defined with VFS_LOCK_GIANT(NULL) call.
This shall fix softdep operation when mpsafe_vfs = 0.

Reported and tested by: kris
Submitted by: tegge
MFC after: 1 week


# 167737 20-Mar-2007 kib

Mark UFS as being MP-Safe in "options QUOTA" case too. Remove no more
neccessary Giant acquisions in softdepend processing code.

Tested by: Peter Holm
Reviewed by: tegge
Approved by: re (kensmith)


# 166924 23-Feb-2007 brian

Account for di_blocks allocations when IN_SPACECOUNTED is set in an
inode's i_flag.

It's possible that after ufs_infactive() calls softdep_releasefile(),
i_nlink stays >0 for a considerable amount of time (> 60 seconds here).
During this period, any ffs allocation routines that alter di_blocks
must also account for the blocks in the filesystem's fs_pendingblocks
value.

This change fixes an eventual df/du discrepency that will happen as
the result of fs_pendingblocks being reduced to <0.

The only manifestation of this that people may recognise is the
following message on boot:

/somefs: update error: blocks -N files M

at which point the negative pending block count is adjusted to zero.

Reviewed by: tegge
MFC after: 3 weeks


# 163874 01-Nov-2006 kib

Aquire Giant in the softdep_flush for clear_remove() and clear_inodedeps()
processing when QUOTA is set.

Reported and tested by: Peter Holm
Reviewed by: tegge
MFC after: 3 days


# 162653 26-Sep-2006 tegge

Reduce fluctuations of mnt_flag to allow unlocked readers to get a
slightly more consistent view.


# 162650 26-Sep-2006 tegge

Increase mnt_noasync once in softdep_mount() to disallow async io,
closing a window where a file system using softupdates could be async
for a short while if both MNT_UPDATE and MNT_ASYNC were passed as flags
to nmount(). Add MNTK_SOFTDEP flag to ensure that softdep_mount()
doesn't increase mnt_noasync multiple times.


# 162647 26-Sep-2006 tegge

Use mount interlock to protect all changes to mnt_flag and mnt_kern_flag.
This eliminates a race where MNT_UPDATE flag could be lost when nmount()
raced against sync(), sync_fsync() or quotactl().


# 162460 20-Sep-2006 kib

Fix the glitch introduced in rev. 1.93. In softdep_sync_metadata(),
switch by worklist type contains two for() loops, for D_INDIRDEP and
D_PAGEDEP. On error, these loops are exited by break, where the switch
actually shall be leaved. Use goto instead of break to reach the error
handling code.

Reported by: Peter Holm
Reviewed by: tegge
Approved by: pjd (mentor)
MFC after: 2 weeks


# 158659 16-May-2006 trhodes

Provide a less cryptic panic message in place of just "found inode."


# 158338 06-May-2006 tegge

ffs_syncvnode() might skip some of the blocks due to them being locked,
assuming them to be inflight write buffers. This is not always the case.
bufdaemon might hold the buffer lock and give up writing the buffer due to it
having dependencies, the file system being suspended or the vnode lock being
held by another thread. When bufdaemon decides to write the buffer there is
still a window before bufobj_wref() has been called, allowing other threads to
believe that the vnode has no dirty buffers or inflight writes.

Try harder to flush first block of new subdirectory to get rid of MKDIR_BODY
dependency.


# 157805 17-Apr-2006 kensmith

Fix panic() message to give the right function name.


# 157447 03-Apr-2006 tegge

Eliminate softdep_flush() livelock by accounting for number of worklist items
marked as being in progress.


# 156587 12-Mar-2006 jeff

- Remove the call to softdep_waitidle after suspending the filesystem.
This does not do what I wanted as all dirty buffers must be flushed
by the call to ffs_sync and any remaining dependency work would mean
that this failed.

Pointed out by: tegge


# 156451 08-Mar-2006 tegge

Use vn_start_secondary_write() and vn_finished_secondary_write() as a
replacement for vn_write_suspend_wait() to better account for secondary write
processing.

Close race where secondary writes could be started after ffs_sync() returned
but before the file system was marked as suspended.

Detect if secondary writes or softdep processing occurred during vnode sync
loop in ffs_sync() and retry the loop if needed.


# 156206 02-Mar-2006 jeff

- Acquire lk in softdep_slowdown so that it's owned when we call
softdep_speedup().
- Assert that lk is held in softdep_speedup() rather than acquiring it.
This avoids a potential lock recursion.


# 156203 02-Mar-2006 jeff

- Move softdep from using a global worklist to per-mount worklists. This
has many positive effects including improved smp locking, reducing
interdependencies between mounts that can lead to deadlocks, etc.
- Add the softdep worklist and various counters to the ufsmnt structure.
- Add a mount pointer to the workitem and remove mount pointers from the
various structures derived from the workitem as they are now redundant.
- Remove the poor-man's semaphore protecting softdep_process_worklist and
softdep_flushworklist. Several threads may now process the list
simultaneously.
- Add softdep_waitidle() to block the thread until all pending
dependencies being operated on by other threads have been flushed.
- Use softdep_waitidle() in unmount and snapshots to block either
operation until the fs is stable.
- Remove softdep worklist processing from the syncer and move it into the
softdep_flush() thread. This thread processes all softdep mounts
once each second and when it is called via the new softdep_speedup()
when there is a resource shortage. This removes the softdep hook
from the kernel and various hacks in header files to support it.

Reviewed by/Discussed with: tegge, truckman, mckusick
Tested by: kris


# 154150 09-Jan-2006 tegge

If the lock passed to getdirtybuf() is the softdep lock then the background
write completed wakeup could be missed. Close the race by grabbing the lock
normally used for protection of bp->b_xflags.

Reviewed by: truckman


# 154149 09-Jan-2006 tegge

Broaden scope of softdep_worklist_busy rwlock protection of softdep processing
to avoid some dependencies being missed by softdep_flushworklist().

Reviewed by: truckman


# 153689 23-Dec-2005 delphij

Typo.


# 150733 29-Sep-2005 truckman

After a rmdir()ed directory has been truncated, force an update of
the directory's inode after queuing the dirrem that will decrement
the parent directory's link count. This will force the update of
the parent directory's actual link to actually be scheduled. Without
this change the parent directory's actual link count would not be
updated until ufs_inactive() cleared the inode of the newly removed
directory, which might be deferred indefinitely. ufs_inactive()
will not be called as long as any process holds a reference to the
removed directory, and ufs_inactive() will not clear the inode if
the link count is non-zero, which could be the result of an earlier
system crash.

If a background fsck is run before the update of the parent directory's
actual link count has been performed, or at least scheduled by
putting the dirrem on the leaf directory's inodedep id_bufwait list,
fsck will corrupt the file system by decrementing the parent
directory's effective link count, which was previously correct
because it already took the removal of the leaf directory into
account, and setting the actual link count to the same value as the
effective link count after the dangling, removed, leaf directory
has been removed. This happens because fsck acts based on the
actual link count, which will be too high when fsck creates the
file system snapshot that it references.

This change has the fortunate side effect of more quickly cleaning
up the large number dirrem structures that linger for an extended
time after the removal of a large directory tree. It also fixes a
potential problem with the shutdown of the syncer thread timing out
if the system is rebooted immediately after removing a large directory
tree.

Submitted by: tegge
MFC after: 3 days


# 149808 05-Sep-2005 tegge

Retain generation count when writing zeroes instead of an inode to disk.

Don't free a struct inodedep if another process is allocating saved inode
memory for the same struct inodedep in initiate_write_inodeblock_ufs[12]().

Handle disappearing dependencies in softdep_disk_io_initiation().

Reviewed by: mckusick


# 149354 21-Aug-2005 tegge

Don't set the COMPLETE flag in an inodedep structure before the related
inode has been written.


# 148608 31-Jul-2005 ups

Delay freeing disk space for file system blocks until all dirty buffers
are safely released. This fixes softdep problems on truncation (deletion)
of files with dirty buffers.

Reviewed by: jeff@, mckusick@, ps@, tegge@
Tested by: glebius@, ps@
MFC after: 3 weeks


# 145824 03-May-2005 jeff

- Don't restrict the softdep stats to DEBUG kernels, they cost nothing to
export. This was happening anyway since this file manually sets DEBUG.
- Add a sysctl for the number of items on the worklist.
- Use a more canonical loop restart in softdep_fsync_mountdev, it saves
some code at the expense of a goto and makes me worry less about
modifying a variable that should be private to the TAILQ_FOREACH_SAFE
macro.


# 144585 03-Apr-2005 jeff

- Move the contents of softdep_disk_prewrite into ffs_geom_strategy to fix
two bugs.
- ffs_disk_prewrite was pulling the vp from the buf and checking for
COPYONWRITE, when really it wanted the vp from the bufobj that we're
writing to, which is the devvp. This lead to us skipping the copy on
write to all file data, which significantly broke snapshots for the
last few months.
- When the SOFTUPDATES option was not included in the kernel config we
would also skip the copy on write check, which would effectively disable
snapshots.
- Remove an invalid mp_fixme().

Debugging tips from: mckusick
Reported by: iedowse, others
Discussed with: phk


# 144118 25-Mar-2005 das

When the softupdates worklist gets too long, threads that attempt to
add more work are forced to process two worklist items first.
However, processing an item may generate additional work, causing the
unlucky thread to recursively process the worklist. Add a per-thread
flag to detect this situation and avoid the recursion. This should
fix the stack overflows that could occur while removing large
directory trees.

Tested by: kris
Reviewed by: mckusick


# 143502 13-Mar-2005 jeff

- The VI_DOOMED flag now signals the end of a vnode's relationship with
the filesystem. Check that rather than VI_XLOCK.

Sponsored by: Isilon Systems, Inc.


# 142263 22-Feb-2005 jeff

- Add VOP locking asserts in several functions that have been implicated in
recent deadlocks.


# 142123 20-Feb-2005 delphij

The recomputation of file system summary at mount time can be a
very slow process, especially for large file systems that is just
recovered from a crash.

Since the summary is already re-sync'ed every 30 second, we will
not lag behind too much after a crash. With this consideration
in mind, it is more reasonable to transfer the responsibility to
background fsck, to reduce the delay after a crash.

Add a new sysctl variable, vfs.ffs.compute_summary_at_mount, to
control this behavior. When set to nonzero, we will get the
"old" behavior, that the summary is computed immediately at mount
time.

Add five new sysctl variables to adjust ndir, nbfree, nifree,
nffree and numclusters respectively. Teach fsck_ffs about these
API, however, intentionally not to check the existence, since
kernels without these sysctls must have recomputed the summary
and hence no adjustments are necessary.

This change has eliminated the usual tens of minutes of delay of
mounting large dirty volumes.

Reviewed by: mckusick
MFC After: 1 week


# 141685 11-Feb-2005 phk

Make non-SOFTUPDATES kernels compile again.

Integrate the stubfile into the main file now that license issues have been
long resolved.


# 141570 09-Feb-2005 phk

style polishing.


# 141541 08-Feb-2005 phk

Use ffs_truncate() directly instead of UFS_TRUNCATE()


# 141539 08-Feb-2005 phk

Background writes are entirely an FFS/Softupdates thing.

Give FFS vnodes a specific bufwrite method which contains all the
background write stuff and then calls into the default bufwrite()
for the rest of the job.

Remove all the background write related stuff from the normal bufwrite.

This drags the softdep_move_dependencies() back into FFS.

Long term, it is worth looking at simply copying the data into
allocated memory and issuing the bio directly and not create the
"shadow buf" in the first place (just like copy-on-write is done
in snapshots for instance). I don't think we really gain anything
but complexity from doing this with a buf.


# 141533 08-Feb-2005 phk

Drag another softupdates tentacle back into FFS: Now that FFS's
vop_fsync is separate from the internal use we can do the full job
there.


# 141526 08-Feb-2005 phk

Don't use the UFS_* and VFS_* functions where a direct call is possble.

The UFS_ functions are for UFS to call back into VFS. The VFS functions
are external entry points into the filesystem.


# 141522 08-Feb-2005 phk

For snapshots we need all VOP_LOCKs to be exclusive.

The "business class upgrade" was implemented in UFS's VOP_LOCK
implementation ufs_lock() which is the wrong layer, so move it to
ffs_lock().

Also, as long as we have not abandonned advanced vfs-stacking we
should not preclude it from happening: instead of implementing a
copy locally, use the VOP_LOCK_APV(&ufs) to correctly arrive at
vop_stdlock() at the bottom.


# 141150 02-Feb-2005 jeff

- Use a seperate malloc tag for saved inode contents to help in debugging
memory modified after free errors.

Sponsored by: Isilon Systems, Inc.


# 140709 24-Jan-2005 jeff

- Convert the global LK lock to a mutex.
- Expand the scope of lk to cover not only interrupt races, but also
top-half races, which includes many new uses over global top-half
only data.
- Get rid of interlocked_sleep() and use msleep or BUF_LOCK where
appropriate.
- Use the lk mutex in place of the various hand rolled semaphores.
- Stop dropping the lk lock before we panic.
- Fix getdirtybuf() callers so that they reacquire access to whatever
softdep datastructure they were inxpecting in the failure/retry
case. Previously, sleeps in getdirtybuf() could leave us with
pointers to bad memory.
- Update handling of ffs to be compatible with ffs locking changes.

Sponsored By: Isilon Systems, Inc.


# 140048 11-Jan-2005 phk

Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC().

I'm not sure why a credential was added to these in the first place, it is
not used anywhere and it doesn't make much sense:

The credentials for syncing a file (ability to write to the
file) should be checked at the system call level.

Credentials for syncing one or more filesystems ("none")
should be checked at the system call level as well.

If the filesystem implementation needs a particular credential
to carry out the syncing it would logically have to the
cached mount credential, or a credential cached along with
any delayed write data.

Discussed with: rwatson


# 139825 07-Jan-2005 imp

/* -> /*- for license, minor formatting changes


# 137035 29-Oct-2004 phk

Move UFS from DEVFS backing to GEOM backing.

This eliminates a bunch of vnode overhead (approx 1-2 % speed
improvement) and gives us more control over the access to the storage
device.

Access counts on the underlying device are not correctly tracked and
therefore it is possible to read-only mount the same disk device multiple
times:
syv# mount -p
/dev/md0 /var ufs rw 2 2
/dev/ad0 /mnt ufs ro 1 1
/dev/ad0 /mnt2 ufs ro 1 1
/dev/ad0 /mnt3 ufs ro 1 1

Since UFS/FFS is not a synchrousely consistent filesystem (ie: it caches
things in RAM) this is not possible with read-write mounts, and the system
will correctly reject this.

Details:

Add a geom consumer and a bufobj pointer to ufsmount.

Eliminate the vnode argument from softdep_disk_prewrite().
Pick the vnode out of bp->b_vp for now. Eventually we
should find it through bp->b_bufobj->b_private.

In the mountcode, use g_vfs_open() once we have used
VOP_ACCESS() to check permissions.

When upgrading and downgrading between r/o and r/w do the
right thing with GEOM access counts. Remove all the
workarounds for not being able to do this with VOP_OPEN().

If we are the root mount, drop the exclusive access count
until we upgrade to r/w. This allows fsck of the root
filesystem and the MNT_RELOAD to work correctly.

Set bo_private to the GEOM consumer on the device bufobj.

Change the ffs_ops->strategy function to call g_vfs_strategy()

In ufs_strategy() directly call the strategy on the disk
bufobj. Same in rawread.

In ffs_fsync() we will no longer see VCHR device nodes, so
remove code which synced the filesystem mounted on it, in
case we came there. I'm not sure this code made sense in
the first place since we would have taken the specfs route
on such a vnode.

Redo the highly bogus readblock() function in the snapshot
code to something slightly less bogus: Constructing an uio
and using physio was really quite a detour. Instead just
fill in a bio and ship it down.


# 136982 26-Oct-2004 phk

KASSERT that we only get to prewrite() on writes.


# 136969 26-Oct-2004 phk

The island council met and voted buf_prewrite() home.

Give ffs it's own bufobj->bo_ops vector and create a private strategy
routine, (currently misnamed for forwards compatibility), which is
just a copy of the generic bufstrategy routine except we call
softdep_disk_prewrite() directly instead of through the buf_prewrite()
indirection.

Teach UFS about the need for softdep_disk_prewrite() and call the
function directly in FFS.

Remove buf_prewrite() from the default bufstrategy() and from the
global bio_ops method vector.


# 136963 26-Oct-2004 phk

Degeneralize the per cdev copyonwrite callback. The only possible value
is ffs_copyonwrite() and the only place it can be called from is FFS which
would never want to call another filesystems copyonwrite method, should one
exist, so there is no reason why anything generic should know about this.


# 136943 25-Oct-2004 phk

Loose the v_dirty* and v_clean* alias macros.

Check the count field where we just want to know the full/empty state,
rather than using TAILQ_EMPTY() or TAILQ_FIRST().


# 136767 22-Oct-2004 phk

Add b_bufobj to struct buf which eventually will eliminate the need for b_vp.

Initialize b_bufobj for all buffers.

Make incore() and gbincore() take a bufobj instead of a vnode.

Make inmem() local to vfs_bio.c

Change a lot of VI_[UN]LOCK(bp->b_vp) to BO_[UN]LOCK(bp->b_bufobj)
also VI_MTX() to BO_MTX(),

Make buf_vlist_add() take a bufobj instead of a vnode.

Eliminate other uses of bp->b_vp where bp->b_bufobj will do.

Various minor polishing: remove "register", turn panic into KASSERT,
use new function declarations, TAILQ_FOREACH_SAFE() etc.


# 136751 21-Oct-2004 phk

Move the VI_BWAIT flag into no bo_flag element of bufobj and call it BO_WWAIT

Add bufobj_wref(), bufobj_wdrop() and bufobj_wwait() to handle the write
count on a bufobj. Bufobj_wdrop() replaces vwakeup().

Use these functions all relevant places except in ffs_softdep.c where
the use if interlocked_sleep() makes this impossible.

Rename b_vnbufs to b_bobufs now that we touch all the relevant files anyway.


# 133327 08-Aug-2004 phk

use bufdone() not biodone().


# 132775 28-Jul-2004 kan

Avoid using casts as lvalues. Introduce DIP_SET macro which sets proper
inode field based on UFS version. Use DIP ro read values and DIP_SET
to modify them throughout FFS code base.


# 131907 10-Jul-2004 marcel

Update for the KDB debugger framework:
o Make debugging code conditional upon KDB.
o Use kdb_backtrace() instead of backtrace().
o Remove inclusion of opt_ddb.h.


# 127955 06-Apr-2004 jhb

Fix a paste-o from the buf_prewrite() cleanup commit and check for the
MNTK_SUSPEND flag on the correct vnode pointer in softdep_disk_prewrite().

Reviewed by: phk
Tested by: kensmith


# 126858 11-Mar-2004 phk

When I was a kid my work table was one cluttered mess an cleaning it up
were a rather overwhelming task. I soon learned that if you don't know
where you're going to store something, at least try to pile it next to
something slightly related in the hope that a pattern emerges.

Apply the same principle to the ffs/snapshot/softupdates code which have
leaked into specfs: Add yet a buf-quasi-method and call it from the
only two places I can see it can make a difference and implement the
magic in ffs_softdep.c where it belongs.

It's not pretty, but at least it's one less layer violated.


# 126853 11-Mar-2004 phk

Properly vector all bwrite() and BUF_WRITE() calls through the same path
and s/BUF_WRITE()/bwrite()/ since it now does the same as bwrite().


# 126154 23-Feb-2004 mckusick

In the function clear_inodedeps(), a FREE_LOCK() should be called
AFTER the call to vn_start_write(), not before it. Otherwise, it is
possible to unlock it multiple times if the vn_start_write() fails.

Submitted by: Juergen Hannken-Illjes <hannken@eis.cs.tu-bs.de>


# 121443 23-Oct-2003 jhb

Move the P_COWINPROGRESS flag from being a per-process p_flag to being a
per-thread td_pflag which doesn't require any locks to read or write as it
is only read or written by curthread on itself.

Glanced at by: mckusick


# 120841 06-Oct-2003 jeff

- My last commit to this file is still not safe, I believe that it may be
due to the recursion in indir_trunc().


# 120839 06-Oct-2003 jeff

- Reinstate 1.142 this was fixed by 1.144.


# 120750 04-Oct-2003 jeff

- The VI assert in getdirtybuf() is only valid if we're not on a VCHR
vnode. VCHR vnodes don't do background writes.

Reported by: kan


# 120732 04-Oct-2003 jeff

- Remove a mp_fixme() and some locks that weren't necessary. I now
understand how this works.


# 119707 03-Sep-2003 jeff

- Several of the callers to getdirtybuf() were erroneously changed to pass
in a list head instead of a pointer to the first element at the time of
the first call. These lists are subject to change, and getdirtybuf()
would refetch from the wrong list in some cases.

Spottedy by: tegge
Pointy hat to: me


# 119604 31-Aug-2003 jeff

- Backout rev 1.142. This caused a deadlock that I do not understand. More
investigation is required.


# 119603 31-Aug-2003 jeff

- Define a new flag for getblk(): GB_NOCREAT. This flag causes getblk() to
bail out if the buffer is not already present.
- The buffer returned by incore() is not locked and should not be sent to
brelse(). Use getblk() with the new GB_NOCREAT flag to preserve the
desired semantics.


# 119601 31-Aug-2003 jeff

- Don't acquire the vnode interlock in drain_output(). Instead, require the
caller to acquire it. This permits drain_output() to be done atomically
with other operations as well as reducing the number of lock operations.
- Assert that the proper locks are held in drain_output().
- Change getdirtybuf() to accept a mutex as an argument. This mutex is used
to protect the vnode's buf list and the BKGRDWAIT flag. This lock is
dropped when we successfully acquire a buffer and held on return
otherwise. These semantics reduce the number of cumbersome cases in
calling code.
- Pass the mtx from getdirtybuf() into interlocked_sleep() and allow this
mutex to be used as the interlock argument to BUF_LOCK() in the LOCKBUF
case of interlocked_sleep().
- Change the return value of getdirtybuf() to be the resulting locked buffer
or NULL otherwise. This is for callers who pass in a list head that
requires a lock. It is necessary since the lock that protects the list
head must be dropped in getdirtybuf() so that we don't have a lock order
reversal with the buf queues lock in bremfree().
- Adjust all callers of getdirtybuf() to match the new semantics.
- Add a comment in indir_trunc() that points at unlocked access to a buf.
This may also be one of the last instances of incore() in the tree.


# 119521 28-Aug-2003 jeff

- Move BX_BKGRDWAIT and BX_BKGRDINPROG to BV_ and the b_vflags field.
- Surround all accesses of the BKGRD{WAIT,INPROG} flags with the vnode
interlock.
- Don't use the B_LOCKED flag and QUEUE_LOCKED for background write
buffers. Check for the BKGRDINPROG flag before recycling or throwing
away a buffer. We do this instead because it is not safe for us to move
the original buffer to a new queue from the callback on the background
write buffer.
- Remove the B_LOCKED flag and the locked buffer queue. They are no longer
used.
- The vnode interlock is used around checks for BKGRDINPROG where it may
not be strictly necessary. If we hold the buf lock the a back-ground
write will not be started without our knowledge, one may only be
completed while we're not looking. Rather than remove the code, Document
two of the places where this extra locking is done. A pass should be
done to verify and minimize the locking later.


# 112367 18-Mar-2003 phk

Including <sys/stdint.h> is (almost?) universally only to be able to use
%j in printfs, so put a newsted include in <sys/systm.h> where the printf
prototype lives and save everybody else the trouble.


# 111856 04-Mar-2003 jeff

- Add a new 'flags' parameter to getblk().
- Define one flag GB_LOCK_NOWAIT that tells getblk() to pass the LK_NOWAIT
flag to the initial BUF_LOCK(). This will eventually be used in cases
were we want to use a buffer only if it is not currently in use.
- Convert all consumers of the getblk() api to use this extra parameter.

Reviwed by: arch
Not objected to by: mckusick


# 111748 02-Mar-2003 des

More low-hanging fruit: kill caddr_t in calls to wakeup(9) / [mt]sleep(9).


# 111463 25-Feb-2003 jeff

- Add an interlock argument to BUF_LOCK and BUF_TIMELOCK.
- Remove the buftimelock mutex and acquire the buf's interlock to protect
these fields instead.
- Hold the vnode interlock while locking bufs on the clean/dirty queues.
This reduces some cases from one BUF_LOCK with a LK_NOWAIT and another
BUF_LOCK with a LK_TIMEFAIL to a single lock.

Reviewed by: arch, mckusick


# 111420 24-Feb-2003 mckusick

When removing the last item from a non-empty worklist, the worklist
tail pointer must be updated.

Reported by: Kris Kennaway <kris@obsecurity.org>
Sponsored by: DARPA & NAI Labs.


# 111119 19-Feb-2003 imp

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 109623 21-Jan-2003 alfred

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 108892 07-Jan-2003 mckusick

This patch fixes a problem caused by applications that rapidly and
repeatedly truncate the same file. Each time the file is truncated,
a buffer is grabbed to store the indirect block numbers that need
to be freed. Those blocks cannot be freed until the inode claiming
them is written to disk. Thus, the number of buffers being held by
soft updates explodes and in extreme cases can run the kernel out
of buffers. The problem can be avoided by doing an fsync on the
file every debug.maxindirdep truncates (currently defaulted to 50).
The fsync causes the inode to be written so that the held buffers
can be freed. The check for excessive buffers is checked as part
of the existing hook for excessive dependencies (softdep_slowdown)
in the truncate code.

Reported by: David Schultz <dschultz@uclink.Berkeley.EDU>
Sponsored by: DARPA & NAI Labs.
MFC after: 3 weeks


# 108533 01-Jan-2003 schweikh

Correct typos, mostly s/ a / an / where appropriate. Some whitespace cleanup,
especially in troff files.


# 107992 17-Dec-2002 phk

Remove unused lockcnt variable.

Approved by: mckusick


# 107096 20-Nov-2002 mckusick

The target for the maximum number of dependencies has been cut
in half because of reports that under heavy load the kernel could
exhaust its memory pool. The limit is now (desiredvnodes * 4)
rather than (desiredvnodes * 8), so it will still scale with
larger systems, just not as quickly.

Sponsored by: DARPA & NAI Labs.


# 107095 20-Nov-2002 mckusick

If an error occurs while writing a buffer, then the data will
not have hit the disk and the dependencies cannot be unrolled.
In this case, the system will mark the buffer as dirty again so
that the write can be retried in the future. When the write
succeeds or the system gives up on the buffer and marks it as
invalid (B_INVAL), the dependencies will be cleared.

Sponsored by: DARPA & NAI Labs.


# 105823 23-Oct-2002 mckusick

We must be careful to avoid recursive copy-on-write faults when
trying to clean up during disk-full senarios.

Sponsored by: DARPA & NAI Labs.


# 105763 23-Oct-2002 mckusick

Missplaced FREE_LOCK causes a panic when hit while taking a snapshot.

Sponsored by: DARPA & NAI Labs.


# 104104 28-Sep-2002 jmallett

When spamming me with a printf(9), under DIAGNOSTIC, at least be nice enough
to include a newline.

MFC after: 4 days
Sponsored by: Bright Path Solutions


# 104094 28-Sep-2002 phk

Be consistent about "static" functions: if the function is marked
static in its prototype, mark it static at the definition too.

Inspired by: FlexeLint warning #512


# 103946 25-Sep-2002 jeff

- Convert locks to use standard macros.
- Lock access to the buflists.
- Document broken locking.
- Use vrefcnt().


# 101308 04-Aug-2002 jeff

- Replace v_flag with v_iflag and v_vflag
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.

Idea stolen from: BSD/OS


# 100382 20-Jul-2002 peter

Fix a warning:
ffs_softdep.c:1630: warning: int format, different type arg (arg 2)


# 100344 19-Jul-2002 mckusick

Add support to UFS2 to provide storage for extended attributes.
As this code is not actually used by any of the existing
interfaces, it seems unlikely to break anything (famous
last words).

The internal kernel interface to manipulate these attributes
is invoked using two new IO_ flags: IO_NORMAL and IO_EXT.
These flags may be specified in the ioflags word of VOP_READ,
VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that
you want to do I/O to the normal data part of the file and
IO_EXT means that you want to do I/O to the extended attributes
part of the file. IO_NORMAL and IO_EXT are mutually exclusive
for VOP_READ and VOP_WRITE, but may be specified individually
or together in the case of VOP_TRUNCATE. For example, when
removing a file, VOP_TRUNCATE is called with both IO_NORMAL
and IO_EXT set. For backward compatibility, if neither IO_NORMAL
nor IO_EXT is set, then IO_NORMAL is assumed.

Note that the BA_ and IO_ flags have been `merged' so that they
may both be used in the same flags word. This merger is possible
by assigning the IO_ flags to the low sixteen bits and the BA_
flags the high sixteen bits. This works because the high sixteen
bits of the IO_ word is reserved for read-ahead and help with
write clustering so will never be used for flags. This merge
lets us get away from code of the form:

if (ioflags & IO_SYNC)
flags |= BA_SYNC;

For the future, I have considered adding a new field to the
vattr structure, va_extsize. This addition could then be
exported through the stat structure to allow applications to
find out the size of the extended attribute storage and also
would provide a more standard interface for truncating them
(via VOP_SETATTR rather than VOP_TRUNCATE).

I am also contemplating adding a pathconf parameter (for
concreteness, lets call it _PC_MAX_EXTSIZE) which would
let an application determine the maximum size of the extended
atribute storage.

Sponsored by: DARPA & NAI Labs.


# 99220 01-Jul-2002 iedowse

Use indirect function pointer hooks instead of #ifdef SOFTUPDATES
direct calls for the two places where the kernel calls into soft
updates code. Set up the hooks in softdep_initialize() and NULL
them out in softdep_uninitialize(). This change allows soft updates
to function correctly when ufs is loaded as a module.

Reviewed by: mckusick


# 99206 01-Jul-2002 iedowse

Add the ffs bits necessary to support unloading of the ufs kernel
module. This adds an ffs_uninit() function that calls ufs_uninit()
and also calls a new softdep_uninitialize() function. Add a stub
for softdep_uninitialize() to cover the non-SOFTUPDATES case.

Reviewed by: mckusick


# 98687 23-Jun-2002 mux

Warning fixes for 64 bits platforms. This eliminates all the
warnings I have had in the FFS code on sparc64.

Reviewed by: mckusick


# 98542 21-Jun-2002 mckusick

This commit adds basic support for the UFS2 filesystem. The UFS2
filesystem expands the inode to 256 bytes to make space for 64-bit
block pointers. It also adds a file-creation time field, an ability
to use jumbo blocks per inode to allow extent like pointer density,
and space for extended attributes (up to twice the filesystem block
size worth of attributes, e.g., on a 16K filesystem, there is space
for 32K of attributes). UFS2 fully supports and runs existing UFS1
filesystems. New filesystems built using newfs can be built in either
UFS1 or UFS2 format using the -O option. In this commit UFS1 is
the default format, so if you want to build UFS2 format filesystems,
you must specify -O 2. This default will be changed to UFS2 when
UFS2 proves itself to be stable. In this commit the boot code for
reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c)
as there is insufficient space in the boot block. Once the size of the
boot block is increased, this code can be defined.

Things to note: the definition of SBSIZE has changed to SBLOCKSIZE.
The header file <ufs/ufs/dinode.h> must be included before
<ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and
ufs_lbn_t.

Still TODO:
Verify that the first level bootstraps work for all the architectures.
Convert the utility ffsinfo to understand UFS2 and test growfs.
Add support for the extended attribute storage. Update soft updates
to ensure integrity of extended attribute storage. Switch the
current extended attribute interfaces to use the extended attribute
storage. Add the extent like functionality (framework is there,
but is currently never used).

Sponsored by: DARPA & NAI Labs.
Reviewed by: Poul-Henning Kamp <phk@freebsd.org>


# 96821 17-May-2002 phk

Fix ufs_daddr_t/daddr_t type problems.

Sponsored by: DARPA & NAI labs.


# 96755 16-May-2002 trhodes

More s/file system/filesystem/g


# 94723 15-Apr-2002 jeff

Don't peak into the malloc_type structure for limits. The desired vnodes
check should be sufficient. This is required for the pending removal of
malloc_type limits.


# 92728 19-Mar-2002 alfred

Remove __P.


# 92462 17-Mar-2002 mckusick

Add a flags parameter to VFS_VGET to pass through the desired
locking flags when acquiring a vnode. The immediate purpose is
to allow polling lock requests (LK_NOWAIT) needed by soft updates
to avoid deadlock when enlisting other processes to help with
the background cleanup. For the future it will allow the use of
shared locks for read access to vnodes. This change touches a
lot of files as it affects most filesystems within the system.
It has been well tested on FFS, loopback, and CD-ROM filesystems.
only lightly on the others, so if you find a problem there, please
let me (mckusick@mckusick.com) know.


# 92363 15-Mar-2002 mckusick

Introduce the new 64-bit size disk block, daddr64_t. Change
the bio and buffer structures to have daddr64_t bio_pblkno,
b_blkno, and b_lblkno fields which allows access to disks
larger than a Terabyte in size. This change also requires
that the VOP_BMAP vnode operation accept and return daddr64_t
blocks. This delta should not affect system operation in
any way. It merely sets up the necessary interfaces to allow
the development of disk drivers that work with these larger
disk block addresses. It also allows for the development of
UFS2 which will use 64-bit block addresses.


# 92299 15-Mar-2002 obrien

Quiet a warning on the Alpha.


# 91406 27-Feb-2002 jhb

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


# 90329 07-Feb-2002 mckusick

Occationally deleted files would hang around for hours or days
without being reclaimed. This bug was introduced in revision 1.95
dealing with filenames placed in newly allocated directory blocks,
thus is not present in 4.X systems. The bug is triggered when a
new entry is made in a directory after the data block containing
the original new entry has been written, but before the inode
that references the data block has been written.

Submitted by: Bill Fenner <fenner@research.att.com>


# 90098 02-Feb-2002 mckusick

When taking a snapshot, we must check for active files that have
been unlinked (e.g., with a zero link count). We have to expunge
all trace of these files from the snapshot so that they are neither
reclaimed prematurely by fsck nor saved unnecessarily by dump.


# 89637 22-Jan-2002 mckusick

This patch fixes a long standing complaint with soft updates in
which small and/or nearly full filesystems would fail with `file
system full' messages when trying to replace a number of existing
files (for example during a system installation). When the allocation
routines are about to fail with a file system full condition, they
make a call to softdep_request_cleanup() which attempts to accelerate
the flushing of pending deletion requests in an effort to free up
space. In the face of filesystem I/O requests that exceed the
available disk transfer capacity, the cleanup request could take
an unbounded amount of time. Thus, the softdep_request_cleanup()
routine will only try for tickdelay seconds (default 2 seconds)
before giving up and returning a filesystem full error. Under typical
conditions, the softdep_request_cleanup() routine is able to free
up space in under fifty milliseconds.


# 89295 12-Jan-2002 mckusick

When going to sleep, we must save our SPL so that it does not get
lost if some other process uses the lock while we are sleeping. We
restore it after we have slept. This functionality is provided by
a new routine interlocked_sleep() that wraps the interlocking with
functions that sleep. This function is then used in place of the
old ACQUIRE_LOCK_INTERLOCKED() and FREE_LOCK_INTERLOCKED() macros.

Submitted by: Debbie Chu <dchu@juniper.net>


# 89270 11-Jan-2002 mckusick

Must call drain_output() before checking the dirty block list
in softdep_sync_metadata(). Otherwise we may miss dependencies
that need to be flushed which will result in a later panic
with the message ``vinvalbuf: dirty bufs''.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
MFC after: 1 week


# 89089 08-Jan-2002 msmith

Initialise the bioops vector hack at runtime rather than at link time. This
avoids the use of common variables.

Reviewed by: mckusick


# 84050 27-Sep-2001 jhb

- Fix some minor whitespace nits.
- Move the SPECIAL_FLAG #define up next to the NOHOLDER #define and fix a
little nit that caused it to be defined as -(sizeof (struct thread) + 1)
instead of -2.


# 83366 12-Sep-2001 julian

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 78191 13-Jun-2001 mckusick

Build on the change in revision 1.98 by Tor.Egge@fast.no.
The symptom being treated in 1.98 was to avoid freeing a
pagedep dependency if there was still a newdirblk dependency
referencing it. That change is correct and no longer prints
a warning message when it occurs. The other part of revision
1.98 was to panic when a newdirblk dependency was encountered
during a file truncation. This fix removes that panic and
replaces it with code to find and delete the newdirblk
dependency so that the truncation can succeed.


# 77743 05-Jun-2001 obrien

There seems to be a problem that the order of disk write operation being
incorrect due to a missing check for some dependency. This change
avoids the freelist corruption (but not the temporarily inconsistent
state of the file system).

A message is printed as a reminder of the under lying problem when a
pagedep structure is not freed due to the NEWBLOCK flag being set.

Submitted by: Tor.Egge@fast.no


# 76859 19-May-2001 mckusick

Must ensure that all the entries on the pd_pendinghd list have been
committed to disk before clearing them. More specifically, when
free_newdirblk is called, we know that the inode claims the new
directory block. However, if the associated pagedep is still linked
onto the directory buffer dependency chain, then some of the entries
on the pd_pendinghd list may not be committed to disk yet. In this
case, we will simply note that the inode claims the block and let
the pd_pendinghd list be processed when the pagedep is next written.
If the pagedep is no longer on the buffer dependency chain, then
all the entries on the pd_pending list are committed to disk and
we can free them in free_newdirblk. This corrects a window of
vulnerability introduced in the code added in version 1.95.


# 76825 18-May-2001 mckusick

Must be a bit less aggressive about freeing pagedep structures.

Obtained from: Robert Watson <rwatson@FreeBSD.org> and
Matthew Jacob <mjacob@feral.com>


# 76724 17-May-2001 mckusick

When a new block is allocated to a directory, an fsync of a file
whose name is within that block must ensure not only that the block
containing the file name has been written, but also that the on-disk
directory inode references that block. When a new directory block
is created, we allocate a newdirblk structure which is linked to
the associated allocdirect (on its ad_newdirblk list). When the
allocdirect has been satisfied, the newdirblk structure is moved
to the inodedep id_bufwait list of its directory to await the inode
being written. When the inode is written, the directory entries
are fully committed and can be deleted from their pagedep->id_pendinghd
and inodedep->id_pendinghd lists.


# 76357 08-May-2001 mckusick

When running with soft updates, track the number of blocks and files
that are committed to being freed and reflect these blocks in the
counts returned by statfs (and thus also by the `df' command). This
change allows programs such as those that do news expiration to
know when to stop if they are trying to create a certain percentage
of free space. Note that this change does not solve the much harder
problem of making this to-be-freed space available to applications
that want it (thus on a nearly full filesystem, you may still
encounter out-of-space conditions even though the free space will
show up eventually). Hopefully this harder problem will be the
subject of a future enhancement.


# 76354 08-May-2001 mckusick

When syncing out snapshot metadata, we must temporarily allow recursive
buffer locking so as to avoid locking against ourselves if we need to
write filesystem metadata.


# 76173 01-May-2001 phk

Remove blatantly pointless call to VOP_BMAP().

Use ufs_bmaparray() rather than VOP_BMAP() on our own vnodes.


# 76117 29-Apr-2001 grog

Revert consequences of changes to mount.h, part 2.

Requested by: bde


# 75858 23-Apr-2001 grog

Correct #includes to work with fixed sys/mount.h.


# 74548 21-Mar-2001 mckusick

Add kernel support for running fsck on active filesystems.


# 73287 01-Mar-2001 mckusick

Free lock before returning from process_worklist_item.

Obtained from: Constantine Sapuntzakis <csapuntz@stanford.edu>


# 72941 23-Feb-2001 mckusick

Free lock before calling panic so that subsequent attempt to write out
buffers does not re-panic with `locking against myself'. This change
should not affect normal operations of soft updates in any way.


# 72872 22-Feb-2001 mckusick

When cleaning up excess inode dependencies, check for being done.

Reviewed by: Jan Koum <jkb@yahoo-inc.com>


# 72765 20-Feb-2001 mckusick

This patch corrects two problems with the rate limiting code
that was introduced in revision 1.80. The problem manifested
itself with a `locking against myself' panic and could also
result in soft updates inconsistences associated with inodedeps.
The two problems are:

1) One of the background operations could manipulate the bitmap
while holding it locked with intent to create. This held lock
results in a `locking against myself' panic, when the background
processing that we have been coopted to do tries to lock the bitmap
which we are already holding locked. To understand how to fix this
problem, first, observe that we can do the background cleanups in
inodedep_lookup only when allocating inodedeps (DEPALLOC is set in
the call to inodedep_lookup). Second observe that calls to
inodedep_lookup with DEPALLOC set can only happen from the following
calls into the softdep code:

softdep_setup_inomapdep
softdep_setup_allocdirect
softdep_setup_remove
softdep_setup_freeblocks
softdep_setup_directory_change
softdep_setup_directory_add
softdep_change_linkcnt

Only the first two of these can come from ffs_alloc.c while holding
a bitmap locked. Thus, inodedep_lookup must not go off to do
request_cleanups when being called from these functions. This change
adds a flag, NODELAY, that can be passed to inodedep_lookup to let
it know that it should not do background processing in those cases.

2) The return value from request_cleanup when helping out with the
cleanup was 0 instead of 1. This meant that despite the fact that
we may have slept while doing the cleanups, the code did not recheck
for the appearance of an inodedep (e.g., goto top in inodedep_lookup).
This lead to the softdep inconsistency in which we ended up with
two inodedep's for the same inode.

Reviewed by: Peter Wemm <peter@yahoo-inc.com>,
Matt Dillon <dillon@earth.backplane.com>


# 72012 04-Feb-2001 phk

Another round of the <sys/queue.h> FOREACH transmogriffer.

Created with: sed(1)
Reviewed by: md5(1)


# 71999 04-Feb-2001 phk

Mechanical change to use <sys/queue.h> macro API instead of
fondling implementation details.

Created with: sed(1)
Reviewed by: md5(1)


# 71998 04-Feb-2001 phk

Use <sys/queue.h> macro API.


# 71820 30-Jan-2001 dillon

Fix a race between the syncer and umount. When you umount a softupdates
filesystem softdep_process_worklist() is called in a loop until it indicates
that no dependancies remain, but the determination of that fact depends on
there only being one softdep_process_worklist() instance running. It was
possible for the syncer to also be running softdep_process_worklist()
and the pre-existing checks in the code to prevent this were not sufficient
to prevent the race. This patch solves the problem.

Approved-by: mckusick


# 69967 13-Dec-2000 mckusick

Preventing runaway kernel soft updates memory, take three.
Previously, the syncer process was the only process in the
system that could process the soft updates background work
list. If enough other processes were adding requests to that
list, it would eventually grow without bound. Because some of
the work list requests require vnodes to be locked, it was
not generally safe to let random processes process the work
list while they already held vnodes locked. By adding a flag
to the work list queue processing function to indicate whether
the calling process could safely lock vnodes, it becomes possible
to co-opt other processes into helping out with the work list.
Now when the worklist gets too large, other processes can safely
help out by picking off those work requests that can be handled
without locking a vnode, leaving only the small number of
requests requiring a vnode lock for the syncer process. With
this change, it appears possible to keep even the nastiest
workloads under control.

Submitted by: Paul Saab <ps@yahoo-inc.com>


# 69781 08-Dec-2000 dwmalone

Convert more malloc+bzero to malloc+M_ZERO.

Submitted by: josh@zipperup.org
Submitted by: Robert Drehmel <robd@gmx.net>


# 69774 08-Dec-2000 phk

Staticize some malloc M_ instances.


# 68933 20-Nov-2000 mckusick

More aggressively rate limit the growth of soft dependency structures
in the face of multiple processes doing massive numbers of filesystem
operations. While this patch will work in nearly all situations, there
are still some perverse workloads that can overwhelm the system.
Detecting and handling these perverse workloads will be the subject
of another patch.

Reviewed by: Paul Saab <ps@yahoo-inc.com>
Obtained from: Ethan Solomita <ethan@geocast.com>


# 68885 18-Nov-2000 dillon

Implement a low-memory deadlock solution.

Removed most of the hacks that were trying to deal with low-memory
situations prior to now.

The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.

Code has been added to stall in a low-memory situation prior to a vnode
being locked.

Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.

Implement a number of VFS/BIO fixes

(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.

In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.

Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.

In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.

There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.

Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>


# 68715 14-Nov-2000 mckusick

When deleting a file, the ordering of events imposed by soft updates
is to first write the deleted directory entry to disk, second write
the zero'ed inode to disk, and finally to release the freed blocks
and the inode back to the cylinder-group map. As this ordering
requires two disk writes to occur which are normally spaced about
30 seconds apart (except when memory is under duress), it takes
about a minute from the time that a file is deleted until its inode
and data blocks show up in the cylinder-group map for reallocation.
If a file has had only a brief lifetime (less than 30 seconds from
creation to deletion), neither its inode nor its directory entry
may have been written to disk. If its directory entry has not been
written to disk, then we need not wait for that directory block to
be written as the on-disk directory block does not reference the
inode. Similarly, if the allocated inode has never been written to
disk, we do not have to wait for it to be written back either as
its on-disk representation is still zero'ed out. Thus, in the case
of a short lived file, we can simply release the blocks and inode
to the cylinder-group map immediately. As the inode and its blocks
are released immediately, they are immediately available for other
uses. If they are not released for a minute, then other inodes and
blocks must be allocated for short lived files, cluttering up the
vnode and buffer caches. The previous code was a bit too aggressive
in trying to release the blocks and inode back to the cylinder-group
map resulting in their being made available when in fact the inode
on disk had not yet been zero'ed. This patch takes a more conservative
approach to doing the release which avoids doing the release prematurely.


# 66886 09-Oct-2000 eivind

Blow away the v_specmountpoint define, replacing it with what it was
defined as (rdev->si_mountpoint)


# 65595 07-Sep-2000 mckusick

Cannot do MALLOC with M_WAITOK while holding ACQUIRE_LOCK

Obtained from: Ethan Solomita <ethan@geocast.com>


# 65557 07-Sep-2000 jasone

Major update to the way synchronization is done in the kernel. Highlights
include:

* Mutual exclusion is used instead of spl*(). See mutex(9). (Note: The
alpha port is still in transition and currently uses both.)

* Per-CPU idle processes.

* Interrupts are run in their own separate kernel threads and can be
preempted (i386 only).

Partially contributed by: BSDi (BSD/OS)
Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh


# 63788 24-Jul-2000 mckusick

This patch corrects the first round of panics and hangs reported
with the new snapshot code.

Update addaliasu to correctly implement the semantics of the old
checkalias function. When a device vnode first comes into existence,
check to see if an anonymous vnode for the same device was created
at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than
creating a new vnode for the device. This corrects a problem which
caused the kernel to panic when taking a snapshot of the root
filesystem.

Change the calling convention of vn_write_suspend_wait() to be the
same as vn_start_write().

Split out softdep_flushworklist() from softdep_flushfiles() so that
it can be used to clear the work queue when suspending filesystem
operations.

Access to buffers becomes recursive so that snapshots can recursively
traverse their indirect blocks using ffs_copyonwrite() when checking
for the need for copy on write when flushing one of their own indirect
blocks. This eliminates a deadlock between the syncer daemon and a
process taking a snapshot.

Ensure that softdep_process_worklist() can never block because of a
snapshot being taken. This eliminates a problem with buffer starvation.

Cleanup change in ffs_sync() which did not synchronously wait when
MNT_WAIT was specified. The result was an unclean filesystem panic
when doing forcible unmount with heavy filesystem I/O in progress.

Return a zero'ed block when reading a block that was not in use at
the time that a snapshot was taken. Normally, these blocks should
never be read. However, the readahead code will occationally read
them which can cause unexpected behavior.

Clean up the debugging code that ensures that no blocks be written
on a filesystem while it is suspended. Snapshots must explicitly
label the blocks that they are writing during the suspension so that
they do not cause a `write on suspended filesystem' panic.

Reorganize ffs_copyonwrite() to eliminate a deadlock and also to
prevent a race condition that would permit the same block to be
copied twice. This change eliminates an unexpected soft updates
inconsistency in fsck caused by the double allocation.

Use bqrelse rather than brelse for buffers that will be needed
soon again by the snapshot code. This improves snapshot performance.


# 62976 11-Jul-2000 mckusick

Add snapshots to the fast filesystem. Most of the changes support
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.

Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).


# 61926 22-Jun-2000 mckusick

Update to new copyright.


# 61813 18-Jun-2000 mckusick

When running with quotas enabled on a filesystem using soft updates,
the system would panic when a user's inode quota was exceeded (see
PR 18959 for details). This fixes that problem.

PR: 18959
Submitted by: Jason Godsey <jason@unixguy.fidalgo.net>


# 61812 18-Jun-2000 mckusick

Some additional performance improvements. When freeing an inode
check to see if it has been committed to disk. If it has never
been written, it can be freed immediately. For short lived files
this change allows the same inode to be reused repeatedly.
Similarly, when upgrading a fragment to a larger size, if it
has never been claimed by an inode on disk, it too can be freed
immediately making it available for reuse often in the next slowly
growing block of the same file.


# 61729 16-Jun-2000 phk

ARGH! I have too many source trees :-(

Fix prototype errors in last commit.


# 61724 16-Jun-2000 phk

Virtualizes & untangles the bioops operations vector.

Ref: Message-ID: <18317.961014572@critter.freebsd.dk> To: current@


# 60938 26-May-2000 jake

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


# 60833 23-May-2000 jake

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


# 60041 05-May-2000 phk

Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by: peter


# 59241 15-Apr-2000 rwatson

Introduce extended attribute support for FFS, allowing arbitrary
(name, value) pairs to be associated with inodes. This support is
used for ACLs, MAC labels, and Capabilities in the TrustedBSD
security extensions, which are currently under development.

In this implementation, attributes are backed to data vnodes in the
style of the quota support in FFS. Support for FFS extended
attributes may be enabled using the FFS_EXTATTR kernel option
(disabled by default). Userland utilities and man pages will be
committed in the next batch. VFS interfaces and man pages have
been in the repo since 4.0-RELEASE and are unchanged.

o ufs/ufs/extattr.h: UFS-specific extattr defines
o ufs/ufs/ufs_extattr.c: bulk of support routines
o ufs/{ufs,ffs,mfs}/*.[ch]: hooks and extattr.h includes
o contrib/softupdates/ffs_softdep.c: extattr.h includes
o conf/options, conf/files, i386/conf/LINT: added FFS_EXTATTR

o coda/coda_vfsops.c: XXX required extattr.h due to ufsmount.h
(This should not be the case, and will be fixed in a future commit)

Currently attributes are not supported in MFS. This will be fixed.

Reviewed by: adrian, bp, freebsd-fs, other unthanked souls
Obtained from: TrustedBSD Project


# 58934 02-Apr-2000 phk

Move B_ERROR flag to b_ioflags and call it BIO_ERROR.

(Much of this done by script)

Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED.

Move b_pblkno and b_iodone_chain to struct bio while we transition, they
will be obsoleted once bio structs chain/stack.

Add bio_queue field for struct bio aware disksort.

Address a lot of stylistic issues brought up by bde.


# 58349 20-Mar-2000 phk

Rename the existing BUF_STRATEGY() to DEV_STRATEGY()

substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)

substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)

This patch is machine generated except for the ccd.c and buf.h parts.


# 58345 20-Mar-2000 phk

Remove B_READ, B_WRITE and B_FREEBUF and replace them with a new
field in struct buf: b_iocmd. The b_iocmd is enforced to have
exactly one bit set.

B_WRITE was bogusly defined as zero giving rise to obvious coding
mistakes.

Also eliminate the redundant struct buf flag B_CALL, it can just
as efficiently be done by comparing b_iodone to NULL.

Should you get a panic or drop into the debugger, complaining about
"b_iocmd", don't continue. It is likely to write on your disk
where it should have been reading.

This change is a step in the direction towards a stackable BIO capability.

A lot of this patch were machine generated (Thanks to style(9) compliance!)

Vinum users: Greg has not had time to test this yet, be careful.


# 56908 30-Jan-2000 mckusick

When writing out bitmap buffers, need to skip over ones that already
have a write in progress. Otherwise one can get in an infinite loop
trying to get them all flushed.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>


# 56209 18-Jan-2000 mckusick

During fastpath processing for removal of a short-lived inode, the
set of restrictions for cancelling an inode dependency (inodedep)
is somewhat stronger than originally coded. Since this check appears
in two places, we codify it into the function check_inode_unwritten
which we then call from the two sites, one freeing blocks and the
other freeing directory entries.

Submitted by: Steinar Haug via Matthew Dillon


# 56208 18-Jan-2000 mckusick

Need to reorganize the flushing of directory entry (pagedep) dependencies
so that they never try to lock an inode corresponding to ".." as this
can lead to deadlock. We observe that any inode with an updated link count
is always pushed into its buffer at the time of the link count change, so
we do not need to do a VOP_UPDATE, but merely find its buffer and write it.
The only time we need to get the inode itself is from the result of a
mkdir whose name will never be ".." and hence locking such an inode will
never request a lock above us in the filesystem tree. Thanks to Brian
Fundakowski Feldman for providing the test program that tickled soft updates
into hanging in "inode" sleep.

Submitted by: Brian Fundakowski Feldman <green@FreeBSD.org>


# 56150 17-Jan-2000 mckusick

Better bounding on softdep_flushfiles; other minor tweeks to checks.


# 56149 17-Jan-2000 mckusick

Must track multiple uncommitted renames until one ultimately gets
committed to disk or is removed.


# 55947 14-Jan-2000 dillon

Non-operational change, fix compiler warning.

Reviewed by: mckusick


# 55931 13-Jan-2000 mckusick

Confirming Peter's fix (locking 101: release the lock before you go
to sleep). Locking 101, part 2: do not look at buffer contents after
you have been asleep. There is no telling what wonderous changes may
have occurred.


# 55928 13-Jan-2000 peter

Free the global softupdates lock prior to tsleep() in getdirtybuf().
This seems to be responsible for a bunch of panics where the process
sleeps and something else finds softupdates "locked" when it shouldn't
be. This commit is unreviewed, but has been a big help here.
Previously my boxes would panic pretty much on the first fsync() that
wrote something to disk.


# 55886 13-Jan-2000 mckusick

Because cylinder group blocks are now written in background,
it is no longer sufficient to get a lock on a buffer to know
that its write has been completed. We have to first get the
lock on the buffer, then check to see if it is doing a
background write. If it is doing background write, we have
to wait for the background write to finish, then check to see
if that fullfilled our dependency, and if not to start another
write. Luckily the explanation is longer than the fix.


# 55885 13-Jan-2000 mckusick

A panic occurs during an fsync when a dirty block associated with
a vnode has not been written (which would clear certain of its
dependencies). The problems arises because fsync with MNT_NOWAIT
no longer pushes all the dirty blocks associated with a vnode. It
skips those that require rollbacks, since they will just get instantly
dirty again. Such skipped blocks are marked so that they will not be
skipped a second time (otherwise circular dependencies would never
clear). So, we fsync twice to ensure that everything will be written
at least once.


# 55794 11-Jan-2000 mckusick

We cannot proceed to free the blocks of the file until the dependencies
have been cleaned up by deallocte_dependencies(). Once that is done, it
is safe to post the request to free the blocks. A similar change is also
needed for the freefile case.


# 55756 10-Jan-2000 phk

Give vn_isdisk() a second argument where it can return a suitable errno.

Suggested by: bde


# 55726 10-Jan-2000 mckusick

Missing FREE_LOCK call before handle_workitem_freeblocks.

Submitted by: "Kenneth D. Merry" <ken@kdm.org>


# 55697 10-Jan-2000 mckusick

Several performance improvements for soft updates have been added:
1) Fastpath deletions. When a file is being deleted, check to see if it
was so recently created that its inode has not yet been written to
disk. If so, the delete can proceed to immediately free the inode.
2) Background writes: No file or block allocations can be done while the
bitmap is being written to disk. To avoid these stalls, the bitmap is
copied to another buffer which is written thus leaving the original
available for futher allocations.
3) Link count tracking. Constantly track the difference in i_effnlink and
i_nlink so that inodes that have had no change other than i_effnlink
need not be written.
4) Identify buffers with rollback dependencies so that the buffer flushing
daemon can choose to skip over them.


# 55694 09-Jan-2000 mckusick

Keep tighter control of removal dependencies by limiting the number
of dirrem structure rather than the collaterally created freeblks
and freefile structures. Limit the rate of buffer dirtying by the
syncer process during periods of intense file removal.


# 55692 09-Jan-2000 mckusick

Reorganize softdep_fsync so that it only does the inode-is-flushed
check before the inode is unlocked while grabbing its parent directory.
Once it is unlocked, other operations may slip in that could make
the inode-is-flushed check fail. Allowing other writes to the inode
before returning from fsync does not break the semantics of fsync
since we have flushed everything that was dirty at the time of the
fsync call.


# 55690 09-Jan-2000 mckusick

Make static non-exported functions from soft updates.


# 54700 16-Dec-1999 mckusick

The function request_cleanup() had a tsleep() with PCATCH. It is
quite dangerous, since the process may hold locks at the point,
and if it is stopped in that tsleep the machine may hang. Because
the sleep is so short, the PCATCH is not required here, so it has
been removed. For the future, the FreeBSD team needs to decide
whether it is still reasonable to stop a process in tsleep, as that
may affect any other code that uses PCATCH while holding kernel locks.

Submitted by: Dmitrij Tejblum <tejblum@arc.hq.cti.ru>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>


# 54444 11-Dec-1999 eivind

Lock reporting and assertion changes.
* lockstatus() and VOP_ISLOCKED() gets a new process argument and a new
return value: LK_EXCLOTHER, when the lock is held exclusively by another
process.
* The ASSERT_VOP_(UN)LOCKED family is extended to use what this gives them
* Extend the vnode_if.src format to allow more exact specification than
locked/unlocked.

This commit should not do any semantic changes unless you are using
DEBUG_VFS_LOCKS.

Discussed with: grog, mch, peter, phk
Reviewed by: peter


# 53577 22-Nov-1999 phk

Convert various pieces of code to use vn_isdisk() rather than checking
for vp->v_type == VBLK.

In ccd: we don't need to call VOP_GETATTR to find the type of a vnode.

Reviewed by: sos


# 53452 20-Nov-1999 phk

struct mountlist and struct mount.mnt_list have no business being
a CIRCLEQ. Change them to TAILQ_HEAD and TAILQ_ENTRY respectively.

This removes ugly mp != (void*)&mountlist comparisons.

Requested by: phk
Submitted by: Jake Burkholder jake@checker.org
PR: 14967


# 50480 28-Aug-1999 peter

$Id$ -> $FreeBSD$


# 49535 08-Aug-1999 phk

Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>,
a few lines into <sys/vnode.h>.

Add a few fields to struct specinfo, paving the way for the fun part.


# 48334 29-Jun-1999 mckusick

No longer need to set B_ASYNC flag since BUF_KERNPROC now
unconditionally sets the identity of the buffer.


# 48276 27-Jun-1999 peter

Keep the inlines for <sys/buf.h> happy..


# 48225 26-Jun-1999 mckusick

Convert buffer locking from using the B_BUSY and B_WANTED flags to using
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.


# 47964 16-Jun-1999 mckusick

Add a vnode argument to VOP_BWRITE to get rid of the last vnode
operator special case. Delete special case code from vnode_if.sh,
vnode_if.src, umap_vnops.c, and null_vnops.c.


# 47940 15-Jun-1999 mckusick

Get rid of the global variable rushjob and replace it with a function in
kern/vfs_subr.c named speedup_syncer() which handles the speedup request.
Change the various clients of rushjob to use the new function.


# 47381 22-May-1999 julian

Cosmetic changes to make it compile without errors in gcc -Wall


# 47131 14-May-1999 mckusick

Add a hook to ffs_fsync to allow soft updates to get first chance at doing
a sync on the block device for the filesystem. That allows it to push the
bitmap blocks before the inode blocks which greatly reduces the number of
inode rollbacks that need to be done.


# 46827 09-May-1999 mckusick

Put back changes that might be causing trouble on Alpha.


# 46616 07-May-1999 mckusick

Get rid of random debugging cruft; sync up with latest version.


# 46609 07-May-1999 mckusick

Severe slowdowns have been reported when creating or removing many
files at once on a filesystem running soft updates. The root of
the problem is that soft updates limits the amount of memory that
may be allocated to dependency structures so as to avoid hogging
kernel memory. The original algorithm just waited for the disk I/O
to catch up and reduce the number of dependencies. This new code
takes a much more aggressive approach. Basically there are two
resources that routinely hit the limit. Inode dependencies during
periods with a high file creation rate and file and block removal
dependencies during periods with a high file removal rate. I have
attacked these problems from two fronts. When the inode dependency
limits are reached, I pick a random inode dependency, UFS_UPDATE
it together with all the other dirty inodes contained within its
disk block and then write that disk block. This trick usually
clears 5-50 inode dependencies in a single disk I/O. For block and
file removal dependencies, I pick a random directory page that has
at least one remove pending and VOP_FSYNC its directory. That
releases all its removal dependencies to the work queue. To further
hasten things along, I also immediately start the work queue process
rather than waiting for its next one second scheduled run.


# 44398 02-Mar-1999 mckusick

Reorganize locking to avoid holding the lock during calls to bdwrite
and brelse (which may sleep in some systems).

Obtained from: Matthew Dillon <dillon@apollo.backplane.com>


# 44383 02-Mar-1999 mckusick

Ensure that softdep_sync_metadata can handle bmsafemap and mkdir entries
if they ever arise (which should not happen as softdep_sync_metadata is
currently used).


# 44102 17-Feb-1999 mckusick

fix double LIST_REMOVE; other cosmetic changes to match version 9.32.
Obtained from: Jeffrey Hsu <hsu@FreeBSD.ORG>


# 43044 22-Jan-1999 dg

Gutted softdep_deallocate_dependencies and replaced it with a panic. It
turns out to not be useful to unwind the dependencies and continue in
the face of a fatal error.
Also changed the log() to a printf() in softdep_error() so that it will
be output in the case of a impending panic.
Submitted by: Kirk McKusick <mckusick@mckusick.com>


# 42374 07-Jan-1999 bde

Don't pass unused unused timestamp args to UFS_UPDATE() or waste
time initializing them. This almost finishes centralizing (in-core)
timestamp updates in ufs_itimes().


# 42354 06-Jan-1999 bde

UFS_UPDATE() takes a boolean `waitfor' arg, so don't pass it the value
MNT_WAIT when we mean boolean `true' or check for that value not being
passed. There was no problem in practice because MNT_WAIT had the
magic value of 1.


# 41659 10-Dec-1998 julian

Remove some compiler warnings.


# 40791 31-Oct-1998 peter

Change dirty block list handling to use TAILQ macros.


# 40692 28-Oct-1998 jkh

Clarify a rather ambiguous debugging message.


# 39933 03-Oct-1998 nate

Fix 'noatime' bug that was unrelated to use of noatime.

The problem is caused when a directory block is compacted. When this
occurs, softdep_change_directoryentry_offset() is called to relocate each
directory entry and adjust its matching diradd structure, if any, to match
the new location of the entry. The bug is that while
softdep_change_directoryentry_offset() correctly adjusts the offsets of
the diradd structures on the pd_diraddhd[] lists (which are not yet ready
to be committed to disk), it fails to adjust the offsets of the diradd
structures on the pd_pendinghd list (which are ready to be committed to
disk). This causes the dependency structures to be inconsistent with
the buf contents. Now, if the compaction has moved a directory entry to
the same offset as one of the diradd structures on the pd_pendinghd list
*and* a syscall is done that tries to remove this directory entry before
this directory block has been written to disk (which would empty
pd_pendinghd), a sanity check in newdirrem() will call panic() when it
notices that the inode number in the entry that it is to be removed doesn't
match the inode number in the diradd structure with that offset of that
entry.

Reviewed by: Kirk McKusick <mckusick@McKusick.COM>
Submitted by: Don Lewis <Don.Lewis@tsc.tdk.com>


# 39623 24-Sep-1998 luoqi

Eliminate a race in VOP_FSYNC() when softupdates is enabled.
Submitted by: Kirk McKusick <mckusick@McKusick.COM>
Two minor changes are also included,
1. Remove gratuitious checks for error return from vn_lock with LK_RETRY set,
vn_lock should always succeed in these cases.
2. Back out change rev. 1.36->1.37, which unnecessarily makes async mount
a little more unstable. It also keeps us in sync with other BSDs.
Suggested by: Bruce Evans <bde@zeta.org.au>


# 38291 12-Aug-1998 julian

Handle the case of moving a directory onto the top of a sibling's
child of the same name.

Submitted by: Kirk Mckusick with fixes from luoqi Chen
Obtained from: Whistle test tree.


# 36936 12-Jun-1998 julian

Note which version of Kirk's sources this corresponds to.


# 36935 12-Jun-1998 julian

Fix the case when renaming to a file that you've just created and deleted,
that had an inode that has not yet been written to disk, when the inode of the
new file is also not yet written to disk, and your old directory entry is not
yet on disk but you need to remove it and the new name exists in memory
but has been deleted but the transaction to write the deleted name to disk
exists and has not yet been cancelled by the request to delete the non
existant name. I don't know how kirk could have missed such a glaring
problem for so long. :-) Especially since the inconsitency survived on
the disk for a whole 4 second on average before being fixed by other code.
This was not a crashing bug but just led to filesystem inconsitencies
if you crashed.

Submitted by: Kirk McKusick (mckusick@mckusick.com)


# 36900 11-Jun-1998 julian

Add B_NOCACHE to several cases where BSD4.4 only required a B_INVAL.
Change worked out by john and kirk in consort.


# 36871 10-Jun-1998 julian

Fix for "live inode" panic.
Submitted by: Kirk McKusick <mckusick@McKusick.COM>
Reviewed by: yeah right...


# 36866 10-Jun-1998 julian

Remove buggy debugging code.


# 36404 27-May-1998 julian

A fix to a debug test from Kirk.


# 36234 19-May-1998 julian

Bring up-to-date with Whistle's current version
Includes some debugging code.


# 36232 19-May-1998 julian

Merge with Kirk's version as of Feb 20

His version 9.23 == our version 1.5 of ffs_softdep.c
His version 9.5 == our version 1.4 of softdep.c


# 36225 19-May-1998 julian

Merge in Kirk's changes to stop softupdates from hogging all of memory.


# 36212 19-May-1998 julian

Change to stop a silly panic. This should be understood better.
Change a buffer swizzle trick to a bcopy. It would be nice if the efficient
trick could be used in the future.


# 36210 19-May-1998 julian

First published FreeBSD version of soft updates Feb 5.


# 36207 19-May-1998 julian

This commit was generated by cvs2svn to compensate for changes in r36206,
which included commits to RCS files with non-trunk default branches.


# 36206 19-May-1998 julian

Import the next version received from kirk after some
FreeBSD feedback.


# 36201 19-May-1998 julian

Import the earliest version of the soft update code that I have.