History log of /freebsd-current/sys/kern/vfs_bio.c
Revision Date Author Comments
# 780666c0 05-Jun-2024 Ryan Libby <rlibby@FreeBSD.org>

getblk: reduce time under bufobj lock

Use the new pctrie combined insert/lookup facility to reduce work and
time under the bufobj interlock when associating a buf with a vnode.

We now do one lookup in the dirty tree and one combined lookup/insert in
the clean tree instead of one lookup in dirty, two in clean, and then an
insert in clean. We also avoid touching the possibly unrelated buf at
the tail of the queue.

Also correct an issue where the actual order of the tail queue depended
on the insertion order due to sign issues.

Reviewed by: kib (previous version), dougm, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D45395


# 3ca6bf79 03-Jun-2024 Ryan Libby <rlibby@FreeBSD.org>

db_show_buffer: minor cleanup

Do some light cleanup to make the output format more consistent for
readability.

Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D45442


# a332ba32 21-May-2024 Ryan Libby <rlibby@FreeBSD.org>

getblk: fail faster with GB_LOCK_NOWAIT

If we asked not to wait on a lock, and then we failed to get a buf lock
because we would have had to wait, then just return the error. This
avoids taking the bufobj lock and a second trip to lockmgr.

Reviewed by: mckusick, kib, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D45245


# 7e4ac11b 01-Mar-2024 Konstantin Belousov <kib@FreeBSD.org>

getblkx(9): be more tolerant but also strict with the buffer size checks

It is possible that on-disk filesystem format causes allocation of
buffers of size larger than maxbcachebuf. Currently, getblkx() and
indirectly bufkva_alloc() panic in that situation.

It is more useful to return an error instead, allowing the system to
continue running.

PR: 277414
Reported by: Robert Morris <rtm@lcs.mit.edu>
MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# e1e84737 28-Nov-2023 Mateusz Guzik <mjg@FreeBSD.org>

Add DEBUG_POISON_POINTER

If you have a pointer which you know points to stale data, you can
fill it with junk so that dereference later will trap

Reviewed by: kib
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D40946


# fdafd315 24-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Automated cleanup of cdefs and other formatting

Apply the following automated changes to try to eliminate
no-longer-needed sys/cdefs.h includes as well as now-empty
blank lines in a row.

Remove /^#if.*\n#endif.*\n#include\s+<sys/cdefs.h>.*\n/
Remove /\n+#include\s+<sys/cdefs.h>.*\n+#if.*\n#endif.*\n+/
Remove /\n+#if.*\n#endif.*\n+/
Remove /^#if.*\n#endif.*\n/
Remove /\n+#include\s+<sys/cdefs.h>\n#include\s+<sys/types.h>/
Remove /\n+#include\s+<sys/cdefs.h>\n#include\s+<sys/param.h>/
Remove /\n+#include\s+<sys/cdefs.h>\n#include\s+<sys/capsicum.h>/

Sponsored by: Netflix


# 31b94065 09-Oct-2023 Zhenlei Huang <zlei@FreeBSD.org>

buf: Add sysctl flag CTLFLAG_TUN to loader tunable

The sysctl variable 'vfs.unmapped_buf_allowed' is actually a loader
tunable. Add sysctl flag CTLFLAG_TUN to it so that `sysctl -T` will
report it correctly.

No functional change intended.

Reviewed by: kib, imp
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D42113


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# c9b19803 14-Jul-2023 John Baldwin <jhb@FreeBSD.org>

memdesc: Retire MEMDESC_BIO.

Instead, change memdesc_bio to examine the bio and return a memdesc of
a more generic type describing the data buffer.

Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D41029


# 8f056492 13-Jul-2023 Doug Moore <dougm@FreeBSD.org>

vfs_bio: initialize pctries

bufobj_init depends on fields bo_dirty.bv_root and bo_clean.bv_root
being zeroed on entry and pctrie_init zeroing whatever is passed to
them, and so does not call pctrie_init for either of them. That fails
if pctrie_init ever changes to do something other that zeroing data,
so add explicit calls to them.

Reviewed by: cem
Differential Revision: https://reviews.freebsd.org/D40978


# 45cc8519 29-May-2023 Colin Percival <cperciva@FreeBSD.org>

tslog: Annotate parts of SYSINIT cpu

Booting an amd64 kernel on Firecracker with 1 CPU and 128 MB of RAM,
SYSINIT cpu takes roughly 2770 us:
* 2280 us in vm_ksubmap_init
* 535 us in kmem_malloc
* 450 us in pmap_zero_page
* 1720 us in pmap_growkernel
* 1620 us in pmap_zero_page
* 80 us in bufinit
* 480 us in cpu_setregs
* 430 us in cpu_setregs calling load_cr0

Much of this is hypervisor overhead: load_cr0 is slow because it traps
to the hypervisor, and 99% of the time in pmap_zero_page is spent when
we first touch the page, presumably due to the host Linux kernel
faulting in backing pages one by one.

Sponsored by: https://www.patreon.com/cperciva
Differential Revision: https://reviews.freebsd.org/D40327


# 4e78addb 30-May-2023 Mark Johnston <markj@FreeBSD.org>

buf: Make the number of pbufs slightly more dynamic

Various subsystems pre-allocate a set of pbufs, allocated to implement
I/O operations. pbuf allocations are transient, unlike most buf
allocations.

Most subsystems preallocate nswbuf or nswbuf/2 pbufs each. The
preallocation ensures that pbuf allocation will succeed in low memory
conditions, which might help avoid deadlocks. Currently we initialize
nswbuf = min(nbuf / 4, 256).

nbuf/4 > 256 on anything but the smallest systems. For example,
nswbuf is 256 in a VM with 128MB of memory. In this configuration, a
firecracker VM with one CPU preallocates over 900 pbufs. This consumes
2MB of RAM and adds several milliseconds to the kernel's (very small)
boot time.

Scale nswbuf by ncpu in the common case. I think this makes more sense
than scaling by the amount of RAM, since pbuf allocations are transient
and aren't used for caching. With the change, we get nswbuf=256 with 8
CPUs. With fewer than 8 CPUs we'll preallocate fewer pbufs than before,
and with more we'll preallocate more.

Event: BSDCan 2023
Reported by: cperciva
Reviewed by: glebius, kib
MFC after: 2 months
Differential Revision: https://reviews.freebsd.org/D40216


# 4d846d26 10-May-2023 Warner Losh <imp@FreeBSD.org>

spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD

The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.

Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix


# e72f7ed4 26-Apr-2023 Mark Johnston <markj@FreeBSD.org>

buf: Dynamically allocate per-CPU buffer queues

To reduce static bloat. No functional change intended.

PR: 269572
Reviewed by: mjg, kib, emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39808


# bcd8cd85 01-Mar-2023 Mark Johnston <markj@FreeBSD.org>

buf: Make buf_daemon_shutdown() a no-op after a panic

As in commit 9d7cc536e261a7, there is no need to do anything in this
context.

MFC after: 1 week


# 9d7cc536 23-Feb-2023 Mark Johnston <markj@FreeBSD.org>

buf: Make bufspace_daemon_shutdown() a no-op after a panic

This function doesn't need to do anything in that context, and calling
wakeup() can lead to recursive panics.

Discussed with: mhorne
MFC after: 1 week


# 020e8a4d 11-Feb-2023 Konstantin Belousov <kib@FreeBSD.org>

allocbuf(): convert direct panic() calls to KASSERT()s

Also do minor style adjustments.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D38549


# 1029dab6 09-Feb-2023 Mitchell Horne <mhorne@FreeBSD.org>

mi_switch(): clean up switch types and their usage

Overall, this is a non-functional change, except for kernels built with
SCHED_STATS. However, the switch types are useful for communicating the
intent of the caller.

1. Ensure that every caller provides a type. In most cases, we upgrade
the basic yield to sched_relinquish() aka SWT_RELINQUISH.
2. The case of sched_bind() is distinct, so add a new switch type SWT_BIND.
3. Remove the two unused types, SWT_PREEMPT and SWT_SLEEPQTIMO.
4. Remove SWT_NONE altogether and assert that callers always provide
a type flag.
5. Reference the mi_switch(9) man page in the comments, as these flags
will be documented there.

Reviewed by: kib, markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38184


# 83286682 08-Nov-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: whack mips remnant

This reverts commit 8ffa01a06199df4d14b56a9261dc2a8b3b156a2f.


# a387bd1b 26-Jul-2022 Dimitry Andric <dim@FreeBSD.org>

Adjust function definition in vfs_bio.c to avoid clang 15 warnings

With clang 15, the following -Werror warning is produced:

sys/kern/vfs_bio.c:3430:11: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
buf_daemon()
^
void

This is because buf_daemon() is declared with a (void) argument list,
but defined with an empty argument list. Make the definition match the
declaration.

MFC after: 3 days


# c84c5e00 18-Jul-2022 Mitchell Horne <mhorne@FreeBSD.org>

ddb: annotate some commands with DB_CMD_MEMSAFE

This is not completely exhaustive, but covers a large majority of
commands in the tree.

Reviewed by: markj
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D35583


# 5bd21cbb 21-Jun-2022 Chuck Silvers <chs@FreeBSD.org>

vfs: fix vfs_bio_clrbuf() for PAGE_SIZE > block size

Calculate the desired page valid mask using math that will not
overflow the types used.

Sponsored by: Netflix

Reviewed by: mckusick, kib, markj
Differential Revision: https://reviews.freebsd.org/D34837


# 1fb00c8f 16-Feb-2022 Konstantin Belousov <kib@FreeBSD.org>

buf_alloc(): Stop using LK_NOWAIT, use LK_NOWITNESS

Despite the buffer taken from cache or free list, it still can be
locked, due to 'lockless lookup' in getblkx() potentially operating on
the freed buffers. The lock is transient, but prevents the use of
LK_NOWAIT there for the goal of neutralizing WITNESS.

Just use LK_NOWITNESS.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 5a8fceb3 21-Feb-2022 Mitchell Horne <mhorne@FreeBSD.org>

boottrace: trace annotations for startup and shutdown

Add trace events for execution of SYSINITs (both static and dynamically
loaded), and to the various steps in the shutdown/panic/reboot paths.

Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
X-NetApp-PR: #23
Differential Revision: https://reviews.freebsd.org/D30187


# c02780b7 27-Jan-2022 Konstantin Belousov <kib@FreeBSD.org>

Add GB_NOWITNESS flag

It prevents WITNESS from recording the lock order for the buffer lock
acquired by getblkx().

Reviewed by: mckusick
Discussed with: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34073


# 5875b94c 17-Jan-2022 Konstantin Belousov <kib@FreeBSD.org>

buf_alloc(): lock the buffer with LK_NOWAIT

The buffer must not be accessed by any other thread, it is freshly
allocated. As such, LK_NOWAIT should be nop but also it prevents
recording the order between the buffer lock and any other locks we might
own in the call to getnewbuf(). In particular, if we own FFS snap lock,
it should avoid triggering false positive warning.

Reviewed by: markj, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34072


# 531f8cfe 22-Jan-2022 Konstantin Belousov <kib@FreeBSD.org>

Use dedicated lock name for pbufs

Also remove a pointer to array variable, use array address directly.

Reviewed by: markj, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34072


# b7ff445f 18-Jan-2022 Alexander Motin <mav@FreeBSD.org>

Reduce bufdaemon/bufspacedaemon shutdown time.

Before this change bufdaemon and bufspacedaemon threads used
kthread_shutdown() to stop activity on system shutdown. The problem is
that kthread_shutdown() has no idea about the wait channel and lock used
by specific thread to wake them up reliably. As result, up to 9 threads
could consume up to 9 seconds to shutdown for no good reason.

This change introduces specific shutdown functions, knowing how to
properly wake up specific threads, reducing wait for those threads on
shutdown/reboot from average 4 seconds to effectively zero.

MFC after: 2 weeks
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D33936


# e76c0108 15-Jan-2022 Alexander Motin <mav@FreeBSD.org>

Fix inverse sleep logic in buf_daemon().

Before commit 3cec5c77d617 buf_daemon() went to longer 1s sleep if
numdirtybuffers <= lodirtybuffers. After that commit new condition
!BIT_EMPTY(BUF_DOMAINS, &bdlodirty) got opposite -- true when one
or more more domains is above lodirtybuffers. As result, on freshly
booted system with no dirty buffers buf_daemon() wakes up 10 times
per second and probably only 1 time per second when there is actual
work to do.

MFC after: 1 week
Reviewed by: kib, markj
Tested by: pho
Differential revision: https://reviews.freebsd.org/D33890


# a5c2d59e 29-Nov-2021 Konstantin Belousov <kib@FreeBSD.org>

Expand comment explaining reasons for automatic swapoff on shutdown

Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33167


# b6f4818a 29-Nov-2021 Gordon Bergling <gbe@FreeBSD.org>

vfs: Fix a typo in a sysctl description

- s/dependecies/dependencies/

MFC after: 3 days


# 08bb51f8 27-Nov-2021 Konstantin Belousov <kib@FreeBSD.org>

shutdown: unmount filesystems after swapoff

Swap on file requires operational underlying mount, otherwise
swapoff_all() is guaranteed to panic due to the default strategy VOP for
reclaimed vnodes.

Reported and tested by: peterj
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33147


# 8587d752 16-Oct-2021 Wuyang Chung <wy-chung@outlook.com>

Correct the name of the second parameter of biowait to wmesg

This parameter is passed directly to msleep, and the name of the msleep
parameter is wmesg. Make them match.

Pull Request: https://github.com/freebsd/freebsd-src/pull/557


# a7b4a54d 01-Nov-2021 Konstantin Belousov <kib@FreeBSD.org>

getblk(): do not require devvp vnodes to be locked

Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32761


# dfd704b7 23-Oct-2021 Kirk McKusick <mckusick@FreeBSD.org>

Allow biodone() to be used as a completion routine.

An ordered series of BIO_READ and BIO_WRITE operations are
typically done as:

while (work to do) {
setup bp for I/O
g_io_request(bp, consumer);
biowait(bp);
}

Here you need to have biodone() called at the completion of
the I/O to set the BIO_DONE flag and awaken the biowait(). The
obvious way to do this would be to set bio_done = biodone, but
biodone() will only take the desired action if bio_done == NULL.
The relevant code at the end of biodone() is:

done = bp->bio_done;
if (done == NULL) {
mtxp = mtx_pool_find(mtxpool_sleep, bp);
mtx_lock(mtxp);
bp->bio_flags |= BIO_DONE;
wakeup(bp);
mtx_unlock(mtxp);
} else
done(bp);

This code would infinitely recurse if biodone() is specified as the
routine to use at completion. So before this change, a wrapper done
function had to be written:

static void
g_io_done(struct bio *bp)
{

bp->bio_done = NULL;
biodone(bp);
bp->bio_done = g_io_done;
}

This commit changes

if (done == NULL)

to

if (done == NULL || done == biodone)

which eliminates the need for the wrapper function.

Reviewed by: kib
Sponsored by: Netflix


# a4667e09 19-Oct-2021 Mark Johnston <markj@FreeBSD.org>

Convert vm_page_alloc() callers to use vm_page_alloc_noobj().

Remove page zeroing code from consumers and stop specifying
VM_ALLOC_NOOBJ. In a few places, also convert an allocation loop to
simply use VM_ALLOC_WAITOK.

Similarly, convert vm_page_alloc_domain() callers.

Note that callers are now responsible for assigning the pindex.

Reviewed by: alc, hselasky, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31986


# 197a4f29 16-Sep-2021 Konstantin Belousov <kib@FreeBSD.org>

buffer pager: allow get_blksize method to return error

Reported and reviewed by: asomers
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31998


# 89786088 10-Aug-2021 Mark Johnston <markj@FreeBSD.org>

amd64: Populate the KMSAN shadow maps and integrate with the VM

- During boot, allocate PDP pages for the shadow maps. The region above
KERNBASE is currently not shadowed.
- Create a dummy shadow for the vm page array. For now, this array is
not protected by the shadow map to help reduce kernel memory usage.
- Grow shadows when growing the kernel map.
- Increase the default kernel stack size when KMSAN is enabled. As with
KASAN, sanitizer instrumentation appears to create stack frames large
enough that the default value is not sufficient.
- Disable UMA's use of the direct map when KMSAN is configured. KMSAN
cannot validate the direct map.
- Disable unmapped I/O when KMSAN configured.
- Lower the limit on paging buffers when KMSAN is configured. Each
buffer has a static MAXPHYS-sized allocation of KVA, which in turn
eats 2*MAXPHYS of space in the shadow map.

Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31295


# 6faf45b3 13-Apr-2021 Mark Johnston <markj@FreeBSD.org>

amd64: Implement a KASAN shadow map

The idea behind KASAN is to use a region of memory to track the validity
of buffers in the kernel map. This region is the shadow map. The
compiler inserts calls to the KASAN runtime for every emitted load
and store, and the runtime uses the shadow map to decide whether the
access is valid. Various kernel allocators call kasan_mark() to update
the shadow map.

Since the shadow map tracks only accesses to the kernel map, accesses to
other kernel maps are not validated by KASAN. UMA_MD_SMALL_ALLOC is
disabled when KASAN is configured to reduce usage of the direct map.
Currently we have no mechanism to completely eliminate uses of the
direct map, so KASAN's coverage is not comprehensive.

The shadow map uses one byte per eight bytes in the kernel map. In
pmap_bootstrap() we create an initial set of page tables for the kernel
and preloaded data.

When pmap_growkernel() is called, we call kasan_shadow_map() to extend
the shadow map. kasan_shadow_map() uses pmap_kasan_enter() to allocate
memory for the shadow region and map it.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29417


# 369706a6 25-Feb-2021 Mark Johnston <markj@FreeBSD.org>

buf: Fix the dirtybufthresh check

dirtybufthresh is a watermark, slightly below the high watermark for
dirty buffers. When a delayed write is issued, the dirtying thread will
start flushing buffers if the dirtybufthresh watermark is reached. This
helps ensure that the high watermark is not reached, otherwise
performance will degrade as clustering and other optimizations are
disabled (see buf_dirty_count_severe()).

When the buffer cache was partitioned into "domains", the dirtybufthresh
threshold checks were not updated. Fix this.

Reported by: Shrikanth R Kamath <kshrikanth@juniper.net>
Reviewed by: rlibby, mckusick, kib, bdrewery
Sponsored by: Juniper Networks, Inc., Klara, Inc.
Fixes: 3cec5c77d6
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28901


# bf0db193 29-Jan-2021 Konstantin Belousov <kib@FreeBSD.org>

buf SU hooks: track buf_start() calls with B_IOSTARTED flag

and only call buf_complete() if previously started. Some error paths,
like CoW failire, might skip buf_start() and do bufdone(), which itself
call buf_complete().

Various SU handle_written_XXX() functions check that io was started
and incomplete parts of the buffer data reverted before restoring them.
This is a useful invariant that B_IO_STARTED on buffer layer allows to
keep instead of changing check and panic into check and return.

Reported by: pho
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundations


# c926114f 27-Jan-2021 Bryan Drewery <bdrewery@FreeBSD.org>

Fix getblk() with GB_NOCREAT returning false-negatives.

It is possible for a buf to be reassigned between the dirty and clean
lists while gbincore_unlocked() looks in each list. Avoid creating
a buffer in that case and fallback to a locked lookup.

This fixes a regression from r363482.

More discussion on potential improvements to the clean and dirty lists
handling is in the review.

Reviewed by: cem, kib, markj, vangyzen, rlibby
Reported by: Suraj.Raju at dell.com
Submitted by: Suraj.Raju at dell.com, cem, [based on both]
MFC after: 2 weeks
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D28375


# cd853791 27-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

Make MAXPHYS tunable. Bump MAXPHYS to 1M.

Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.

Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.

Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.

Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.

Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225


# 44ca4575 21-Oct-2020 Brooks Davis <brooks@FreeBSD.org>

vmapbuf: don't smuggle address or length in buf

Instead, add arguments to vmapbuf. Since this argument is
always a pointer use a type of void * and cast to vm_offset_t in
vmapbuf. (In CheriBSD we've altered vm_fault_quick_hold_pages to
take a pointer and check its bounds.)

In no other situtation does b_data contain a user pointer and vmapbuf
replaces b_data with the actual mapping.

Suggested by: jhb
Reviewed by: imp, jhb
Obtained from: CheriBSD
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D26784


# c2c6fb90 09-Oct-2020 Bryan Drewery <bdrewery@FreeBSD.org>

Use unlocked page lookup for inmem() to avoid object lock contention

Reviewed By: kib, markj
Submitted by: mlaier
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D26653


# 9ceba224 01-Oct-2020 Bryan Drewery <bdrewery@FreeBSD.org>

Revert r366340.

CR wasn't finished and it breaks the build.


# 2398cd11 01-Oct-2020 Bryan Drewery <bdrewery@FreeBSD.org>

Use unlocked page lookup for inmem() to avoid object lock contention

Reviewed By: kib, markj
Sponsored by: Dell EMC Isilon
Submitted by: mlaier
Differential Revision: https://reviews.freebsd.org/D26597


# 6fed89b1 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

kern: clean up empty lines in .c and .h files


# 7ad2a82d 18-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the error parameter from vn_isdisk, introduce vn_isdisk_error

Most consumers pass NULL.


# 9da903e5 02-Aug-2020 Conrad Meyer <cem@FreeBSD.org>

Unlocked getblk: Fix new false-positive assertion

A free buf's lock may be held (temporarily) due to unlocked lookup, so
buf_alloc() must acquire it without LK_NOWAIT. The unlocked getblk path
should unlock it promptly once it realizes the identity does not match
the buffer it was searching for.

Reported by: gallatin
Reviewed by: kib
Tested by: pho
X-MFC-With: r363482
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D25914


# d6a75d39 30-Jul-2020 Conrad Meyer <cem@FreeBSD.org>

getblk: Remove a non-sensical LK_NOWAIT | LK_SLEEPFAIL

No functional change.

LK_SLEEPFAIL implies a behavior that is only possible if the lock operation can
sleep. LK_NOWAIT prevents the lock operation from sleeping.

Discussed with: kib


# 59d13f61 30-Jul-2020 Conrad Meyer <cem@FreeBSD.org>

getblk: Avoid sleeping on wrong buf in lockless path

If the buffer identity changed during lookup, sleeping could introduce a
lock order reversal. Since we do not know if the identity changed until we
get the lock, we must try-lock (LK_NOWAIT) only. EINTR and ERESTART error
handling becomes irrelevant, as we no longer sleep.

Reported by: kib
Reviewed by: kib
X-MFC-With: r363482
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D25898


# 81dc6c2c 24-Jul-2020 Conrad Meyer <cem@FreeBSD.org>

Use gbincore_unlocked for unprotected incore()

Reviewed by: markj
Sponsored by: Isilon
Differential Revision: https://reviews.freebsd.org/D25790


# 68ee1dda 24-Jul-2020 Conrad Meyer <cem@FreeBSD.org>

Add unlocked/SMR fast path to getblk()

Convert the bufobj tries to an SMR zone/PCTRIE and add a gbincore_unlocked()
API wrapping this functionality. Use it for a fast path in getblkx(),
falling back to locked lookup if we raced a thread changing the buf's
identity.

Reported by: Attilio
Reviewed by: kib, markj
Testing: pho (in progress)
Sponsored by: Isilon
Differential Revision: https://reviews.freebsd.org/D25782


# 422f38d8 10-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fix trivial whitespace issues which don't interefere with blame

.. even without the -w switch


# d79ff54b 25-May-2020 Chuck Silvers <chs@FreeBSD.org>

This commit enables a UFS filesystem to do a forcible unmount when
the underlying media fails or becomes inaccessible. For example
when a USB flash memory card hosting a UFS filesystem is unplugged.

The strategy for handling disk I/O errors when soft updates are
enabled is to stop writing to the disk of the affected file system
but continue to accept I/O requests and report that all future
writes by the file system to that disk actually succeed. Then
initiate an asynchronous forced unmount of the affected file system.

There are two cases for disk I/O errors:

- ENXIO, which means that this disk is gone and the lower layers
of the storage stack already guarantee that no future I/O to
this disk will succeed.

- EIO (or most other errors), which means that this particular
I/O request has failed but subsequent I/O requests to this
disk might still succeed.

For ENXIO, we can just clear the error and continue, because we
know that the file system cannot affect the on-disk state after we
see this error. For EIO or other errors, we arrange for the geom_vfs
layer to reject all future I/O requests with ENXIO just like is
done when the geom_vfs is orphaned. In both cases, the file system
code can just clear the error and proceed with the forcible unmount.

This new treatment of I/O errors is needed for writes of any buffer
that is involved in a dependency. Most dependencies are described
by a structure attached to the buffer's b_dep field. But some are
created and processed as a result of the completion of the dependencies
attached to the buffer.

Clearing of some dependencies require a read. For example if there
is a dependency that requires an inode to be written, the disk block
containing that inode must be read, the updated inode copied into
place in that buffer, and the buffer then written back to disk.

Often the needed buffer is already in memory and can be used. But
if it needs to be read from the disk, the read will fail, so we
fabricate a buffer full of zeroes and pretend that the read succeeded.
This zero'ed buffer can be updated and written back to disk.

The only case where a buffer full of zeros causes the code to do
the wrong thing is when reading an inode buffer containing an inode
that still has an inode dependency in memory that will reinitialize
the effective link count (i_effnlink) based on the actual link count
(i_nlink) that we read. To handle this case we now store the i_nlink
value that we wrote in the inode dependency so that it can be
restored into the zero'ed buffer thus keeping the tracking of the
inode link count consistent.

Because applications depend on knowing when an attempt to write
their data to stable storage has failed, the fsync(2) and msync(2)
system calls need to return errors if data fails to be written to
stable storage. So these operations return ENXIO for every call
made on files in a file system where we have otherwise been ignoring
I/O errors.

Coauthered by: mckusick
Reviewed by: kib
Tested by: Peter Holm
Approved by: mckusick (mentor)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24088


# bbca4bd7 30-Mar-2020 Konstantin Belousov <kib@FreeBSD.org>

buffer pager: skip bogus pages.

We cannot validate bogus page by reading a buffer.

PR: 244713
Reviewed by: glebius, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24038


# 695e0701 05-Mar-2020 Konstantin Belousov <kib@FreeBSD.org>

buffer pager: deref ucred immediately after read.

Ucred is passed to bread(9) so that non-local filesystems use proper
credentials. But, since clean buffer might be cached unless
buf_pager_relbuf is not enabled, it makes credentials to have extra
reference until buffer is reclaimed. Ucred reference would prevent
jail from destroying if creds are jailed.

Dereferencing the read credentials on the valid buffer avoid that, and
should be fine because the buffer is valid and does not need re-read.

PR: 238032
Reported by: bz
Reproduced and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D23775


# 6be21eb7 28-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Provide a lock free alternative to resolve bogus pages. This is not likely
to be much of a perf win, just a nice code simplification.

Reviewed by: markj, kib
Differential Revision: https://reviews.freebsd.org/D23866


# 7aaf252c 28-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Convert a few triviail consumers to the new unlocked grab API.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23847


# c99d0c58 28-Feb-2020 Mark Johnston <markj@FreeBSD.org>

Add a blocking counter KPI.

refcount(9) was recently extended to support waiting on a refcount to
drop to zero, as this was needed for a lockless VM object
paging-in-progress counter. However, this adds overhead to all uses of
refcount(9) and doesn't really match traditional refcounting semantics:
once a counter has dropped to zero, the protected object may be freed at
any point and it is not safe to dereference the counter.

This change removes that extension and instead adds a new set of KPIs,
blockcount_*, for use by VM object PIP and busy.

Reviewed by: jeff, kib, mjg
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23723


# f1fa1ba3 03-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

Fix up various vnode-related asserts which did not dump the used vnode


# 3ff65f71 30-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

Remove duplicated empty lines from kern/*.c

No functional changes.


# 879e0604 11-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

Add KERNEL_PANICKED macro for use in place of direct panicstr tests


# b249ce48 03-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the mostly unused flags argument from VOP_UNLOCK

Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427


# 686bcb5c 15-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

schedlock 4/4

Don't hold the scheduler lock while doing context switches. Instead we
unlock after selecting the new thread and switch within a spinlock
section leaving interrupts and preemption disabled to prevent local
concurrency. This means that mi_switch() is entered with the thread
locked but returns without. This dramatically simplifies scheduler
locking because we will not hold the schedlock while spinning on
blocked lock in switch.

This change has not been made to 4BSD but in principle it would be
more straightforward.

Discussed with: markj
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22778


# d6fee74a 12-Dec-2019 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern_sync(9), and make kernel code call it instead of going
via sys_sync(2). Minor cleanup, no functional changes.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19366


# 61322a0a 04-Dec-2019 Alexander Motin <mav@FreeBSD.org>

Mark some more hot global variables with __read_mostly.

MFC after: 1 week


# d00066a5 03-Dec-2019 Kirk McKusick <mckusick@FreeBSD.org>

Currently the breadn_flags() and getblkx() interfaces are passed
the vnode, logical block number, and size of data block that is
being requested. They then use the VOP_BMAP function to calculate
the mapping from logical block number to physical block number from
which to access the data. This change expands the interface to also
pass the physical block number in cases where the VOP_MAP function
may no longer work, for example when a file is being truncated.

No functional change.

Reviewed by: kib
Tested by: Peter Holm
Sponsored by: Netflix


# 6ee653cf 29-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

Drop the object lock in vfs_bio and cluster where it is now safe to do so.

Recent changes to busy/valid/dirty have enabled page based synchronization
and the object lock is no longer required in many cases.

Reviewed by: kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21597


# 0012f373 14-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

(4/6) Protect page valid with the busy lock.

Atomics are used for page busy and valid state when the shared busy is
held. The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21594


# 63e97555 14-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

(1/6) Replace busy checks with acquires where it is trival to do so.

This is the first in a series of patches that promotes the page busy field
to a first class lock that no longer requires the object lock for
consistency.

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21548


# e61e783b 04-Oct-2019 Eric van Gyzen <vangyzen@FreeBSD.org>

Add CTLFLAG_STATS to some vfs sysctl OIDs

Add CTLFLAG_STATS to the following OIDs:

vfs.altbufferflushes
vfs.recursiveflushes
vfs.barrierwrites
vfs.flushwithdeps
vfs.reassignbufcalls

Refer to r353111.

MFC after: 2 weeks
Sponsored by: Dell EMC Isilon


# 11b57401 12-Sep-2019 Hans Petter Selasky <hselasky@FreeBSD.org>

Use REFCOUNT_COUNT() to obtain refcount where appropriate.
Refcount waiting will set some flag bits in the refcount value.
Make sure these bits get cleared by using the REFCOUNT_COUNT()
macro to obtain the actual refcount.

Differential Revision: https://reviews.freebsd.org/D21620
Reviewed by: kib@, markj@
MFC after: 1 week
Sponsored by: Mellanox Technologies


# aaa38524 11-Sep-2019 Conrad Meyer <cem@FreeBSD.org>

buf: Add B_INVALONERR flag to discard data

Setting the B_INVALONERR flag before a synchronous write causes the buf
cache to forcibly invalidate contents if the write fails (BIO_ERROR).

This is intended to be used to allow layers above the buffer cache to make
more informed decisions about when discarding dirty buffers without
successful write is acceptable.

As a proof of concept, use in msdosfs to handle failures to mark the on-disk
'dirty' bit during rw mount or ro->rw update.

Extending this to other filesystems is left as future work.

PR: 210316
Reviewed by: kib (with objections)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21539


# 4cdea4a8 10-Sep-2019 Jeff Roberson <jeff@FreeBSD.org>

Use the sleepq lock rather than the page lock to protect against wakeup
races with page busy state. The object lock is still used as an interlock
to ensure that the identity stays valid. Most callers should use
vm_page_sleep_if_busy() to handle the locking particulars.

Reviewed by: alc, kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21255


# a6935d08 05-Sep-2019 Conrad Meyer <cem@FreeBSD.org>

Remove long-dead BUF_ASSERT_{,UN}HELD assertions

These were fully neutered in r177676 (2008), but not removed at the time for
unclear reasons. They're totally dead code, so go ahead and yank them now.

No functional change.


# 772dd133 28-Aug-2019 Mark Johnston <markj@FreeBSD.org>

Avoid direct accesses of the vm_page wire_count field.

No functional change intended.

Sponsored by: Netflix


# 98549e2d 29-Jul-2019 Mark Johnston <markj@FreeBSD.org>

Centralize the logic in vfs_vmio_unwire() and sendfile_free_page().

Both of these functions atomically unwire a page, optionally attempt
to free the page, and enqueue or requeue the page. Add functions
vm_page_release() and vm_page_release_locked() to perform the same task.
The latter must be called with the page's object lock held.

As a side effect of this refactoring, the buffer cache will no longer
attempt to free mapped pages when completing direct I/O. This is
consistent with the handling of pages by sendfile(SF_NOCACHE).

Reviewed by: alc, kib
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20986


# daec9284 21-May-2019 Conrad Meyer <cem@FreeBSD.org>

Include ktr.h in more compilation units

Similar to r348026, exhaustive search for uses of CTRn() and cross reference
ktr.h includes. Where it was obvious that an OS compat header of some kind
included ktr.h indirectly, .c files were left alone. Some of these files
clearly got ktr.h via header pollution in some scenarios, or tinderbox would
not be passing prior to this revision, but go ahead and explicitly include it
in files using it anyway.

Like r348026, these CUs did not show up in tinderbox as missing the include.

Reported by: peterj (arm64/mp_machdep.c)
X-MFC-With: r347984
Sponsored by: Dell EMC Isilon


# 8e7130a8 29-Apr-2019 Mark Johnston <markj@FreeBSD.org>

Stop checking TD_IDLETHREAD() in buffer cache routines.

These predicates are vestigal and cannot be true today. For example,
idle threads are not allowed to acquire locks.

Also cache curthread in breada().

No functional change intended.

Reviewed by: kib, mckusick
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20066


# f841e638 26-Apr-2019 Alan Somers <asomers@FreeBSD.org>

[skip ci] fix typo in comment from r59840

MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# 3193b25a 12-Mar-2019 Kirk McKusick <mckusick@FreeBSD.org>

This is an additional fix for bug report 230962. When using
extended attributes, the kernel can panic with either "ffs_truncate3"
or with "softdep_deallocate_dependencies: dangling deps".

The problem arises because the flushbuflist() function which is
called to clear out buffers is passed either the V_NORMAL flag to
indicate that it should flush buffer associated with the contents
of the file or the V_ALT flag to indicate that it should flush the
buffers associated with the extended attribute data. The buffers
containing the extended attribute data are identified by having
their BX_ALTDATA flag set in the buffer's b_xflags field. The
BX_ALTDATA flag is set on the buffer when the extended attribute
block is first allocated or when its contents are read in from the
disk.

On a busy system, a buffer may be reused for another purpose, but
the contents of the block that it contained continues to be held
in the main page cache. Each physical page is identified as holding
the contents of a logical block within a specified file (identified
by a vnode). When a request is made to read a file, the kernel first
looks for the block in the existing buffers. If it is not found
there, it checks the page cache to see if it is still there. If
it is found in the page cache, then it is remapped into a new
buffer thus avoiding the need to read it in from the disk.

The bug is that when a buffer request made for an extended attribute
is fulfilled by reconstituting a buffer from the page cache rather
than reading it in from disk, the BX_ALTDATA flag was not being
set. Thus the flushbuflist() function would never clear it out and
the "ffs_truncate3" panic would occur because the vnode being cleared
still had buffers on its clean-buffer list. If the extended attribute
was being updated, it is first read, then updated, and finally
written. If the read is fulfilled by reconstituting the buffer
from the page cache the BX_ALTDATA flag was not set and thus the
dirty buffer would never be flushed by flushbuflist(). Eventually
the buffer would be recycled. Since it was never written it would
have an unfinished dependency which would trigger the
"softdep_deallocate_dependencies: dangling deps" panic.

The fix is to ensure that the BX_ALTDATA flag is set when a buffer
has been reconstituted from the page cache.

PR: 230962
Reported by: 2t8mr7kx9f@protonmail.com
Reviewed by: kib
Tested by: Peter Holm
MFC after: 1 week
Sponsored by: Netflix


# 93fa5ae7 11-Mar-2019 Kirk McKusick <mckusick@FreeBSD.org>

Augment DDB "show buffer" command to print the buffer's referenced
vnode pointer (b_vp). The value of b_vp can be used by "show vnode"
to print the vnode and "show vnodebufs" to print all the clean and
dirty buffers associated with the vnode (which should include this
buffer).

Sponsored by: Netflix


# dab83bd1 25-Jan-2019 Kirk McKusick <mckusick@FreeBSD.org>

Add printing of b_ioflags to DDB `show buffer' command.

Sponsored by: Netflix


# d1bb5d7d 16-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Fix mistake in r343030: move nswbuf calculation back to
kern_vfs_bio_buffer_alloc(), because in init_param2() nbuf
isn't really initialized yet.

Pointed out by: bde


# 756a5412 14-Jan-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.

o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.

Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.

The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.

o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.

This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.

Together with: gallatin
Tested by: pho


# 200bf727 01-Dec-2018 Konstantin Belousov <kib@FreeBSD.org>

Correct accuracy of the barrier writes accounting.

Discussed with: mckusick
MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# f71ef9b6 06-Nov-2018 Mark Johnston <markj@FreeBSD.org>

Use plain atomic_{add,subtract} when that's sufficient.

CID: 1386920
MFC after: 2 weeks


# 3fb14f61 01-Jun-2018 Mark Johnston <markj@FreeBSD.org>

Avoid completing I/O when dumping core after a panic.

Filesystem or pager completion callbacks are generally non-functional
after a panic and may trigger deadlocks if invoked in this context
(e.g., by attempting to destroying a buffer mapping). To avoid this
situation, short-circuit I/O completion in biodone().

Reviewed by: imp
Discussed with: mav
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D15592


# 84482abd 18-May-2018 Matt Macy <mmacy@FreeBSD.org>

vfs: annotate variables only used by debug builds as __unused


# 2ebc8829 13-May-2018 Konstantin Belousov <kib@FreeBSD.org>

Detect and optimize reads from the hole on UFS.

- Create getblkx(9) variant of getblk(9) which can return error.
- Add GB_NOSPARSE flag for getblk()/getblkx() which requests that BMAP
was performed before the buffer is created, and EJUSTRETURN returned
in case the requested block does not exist.
- Make ffs_read() use GB_NOSPARSE to avoid instantiating buffer (and
allocating the pages for it), copying from zero_region instead.

The end result is less page allocations and buffer recycling when a
hole is read, which is important for some benchmarks.

Requested and reviewed by: jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D14917


# 1b5c869d 04-May-2018 Mark Johnston <markj@FreeBSD.org>

Fix some races introduced in r332974.

With r332974, when performing a synchronized access of a page's "queue"
field, one must first check whether the page is logically dequeued. If
so, then the page lock does not prevent the page from being removed
from its page queue. Intoduce vm_page_queue(), which returns the page's
logical queue index. In some cases, direct access to the "queue" field
is still required, but such accesses should be confined to sys/vm.

Reported and tested by: pho
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D15280


# 7dfbbc61 22-Apr-2018 Tijl Coosemans <tijl@FreeBSD.org>

Make bufdaemon and bufspacedaemon use kthread_suspend_check instead of
kproc_suspend_check. In r329612 bufspacedaemon was turned into a thread
of the bufdaemon process causing both to call kproc_suspend_check with the
same proc argument and that function contains the following while loop:

while (SIGISMEMBER(p->p_siglist, SIGSTOP)) {
wakeup(&p->p_siglist);
msleep(&p->p_siglist, &p->p_mtx, PPAUSE, "kpsusp", 0);
}

So one thread wakes up the other and the other wakes up the first again,
locking up UP machines on shutdown.

Also register the shutdown handlers with SHUTDOWN_PRI_LAST + 100 so they
run after the syncer has shutdown, because the syncer can cause a
situation where bufdaemon help is needed to proceed.

PR: 227404
Reviewed by: kib
Tested by: cy, rmacklem


# 6469bdcd 06-Apr-2018 Brooks Davis <brooks@FreeBSD.org>

Move most of the contents of opt_compat.h to opt_global.h.

opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941


# e8cbe51a 26-Mar-2018 Jeff Roberson <jeff@FreeBSD.org>

Fix a bug introduced in r329612 that slowly invalidates all clean bufs.

Reported by: bde
Reviewed by: bde
Sponsored by: Netflix, Dell/EMC Isilon


# 27cd06b3 21-Mar-2018 Gleb Smirnoff <glebius@FreeBSD.org>

Redo r331328. We need to fix not only type but also format. While
here again notice that we are fixing regression from r331106.


# 5aab68f2 21-Mar-2018 Gleb Smirnoff <glebius@FreeBSD.org>

Fix sysctl types broken in r329612.


# a7defaea 21-Mar-2018 Mark Johnston <markj@FreeBSD.org>

Elide the object lock in the common case in vfs_vmio_unwire().

The object lock was only needed when attempting to free B_DIRECT
buffer pages, and for testing for invalid pages (and freeing them
if so). Handle the latter by instead moving invalid pages near the head
of the inactive queue, where they will be reclaimed quickly.

Reviewed by: alc, kib, jeff
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D14778


# 3e867f24 21-Mar-2018 Warner Losh <imp@FreeBSD.org>

bufshutdown is no longer called with Giant held, so there's no need to
drop or pickup Giant anymore. Remove that code and adjust comments.


# 2acde6a8 19-Mar-2018 Justin Hibbits <jhibbits@FreeBSD.org>

Cast through uintptr_t to narrow the buf domain pointer on 32-bit archs

arg2 is an intmax_t, which on 32-bit architectures is 64 bits, wider than a
pointer. When &bdomain[i] is added to arg2 it widens from uintptr_t to
intmax_t, then gcc whines when it gets cast to a pointer. Casting through
uintptr_t silences this warning.


# 0eb50f9c 18-Mar-2018 Mark Johnston <markj@FreeBSD.org>

Have vm_page_{deactivate,launder}() requeue already-queued pages.

In many cases the page is not enqueued so the change will have no
effect. However, the change is needed to support an optimization in
the fault handler and in some cases (sendfile, the buffer cache) it
was being emulated by the caller anyway.

Reviewed by: alc
Tested by: pho
MFC after: 2 weeks
X-Differential Revision: https://reviews.freebsd.org/D14625


# 3cec5c77 17-Mar-2018 Jeff Roberson <jeff@FreeBSD.org>

Move the dirty queues inside the per-domain structure. This resolves a bug
where we had not hit global dirty limits but a single queue was starved
for space by dirty buffers. A single buf_daemon is maintained for now.

Add a bd_speedup() when we are low on bufspace. This can happen due to SUJ
keeping many bufs locked until a cg block is written. Document this with
a comment.

Fix sysctls to work with per-domain variables. Add more ddb debugging.

Reported by: pho
Reviewed by: kib
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14705


# 330b675f 14-Mar-2018 Conrad Meyer <cem@FreeBSD.org>

vfs_bio.c: Apply cleanups motivated by Coverity analysis

It is believed that the conditions Coverity indicated were actually
impossible to hit. So this patch just adds a cleanup to only compute
v_mount once in brelse(), and in vfs_bio_getpages() always initializes error
to zero to appease the static analyzer.

No functional change intended.

Submitted by: Darrick Lew <darrick.freebsd AT gmail.com>
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14613


# 1c2529ab 24-Feb-2018 Jeff Roberson <jeff@FreeBSD.org>

Fix issues with sparse cpu allocation. Consistently use mp_maxid + 1.

Reported by: pho
Reviewed by: markj
Sponsored by: Netflix, Dell/EMC Isilon


# a0c722bd 22-Feb-2018 Mateusz Guzik <mjg@FreeBSD.org>

Fix up sysctl vfs.buffercache broken in r329612

Sample problem:
top: sysctl(vfs.bufspace...) expected 8, got 4

Reported by: O. Hartmann <ohartmann walstatt.org>


# 683ca3a4 20-Feb-2018 Jeff Roberson <jeff@FreeBSD.org>

Fix the broken subqueue assignment for the cleanq.

Reported by: pho
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon


# 06220fa7 19-Feb-2018 Jeff Roberson <jeff@FreeBSD.org>

Further parallelize the buffer cache.

Provide multiple clean queues partitioned into 'domains'. Each domain manages
its own bufspace and has its own bufspace daemon. Each domain has a set of
subqueues indexed by the current cpuid to reduce lock contention on the cleanq.

Refine the sleep/wakeup around the bufspace daemon to use atomics as much as
possible.

Add a B_REUSE flag that is used to requeue bufs during the scan to approximate
LRU rather than locking the queue on every use of a frequently accessed buf.

Implement bufspace_reserve with only atomic_fetchadd to avoid loop restarts.

Reviewed by: markj
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14274


# e958ad4c 12-Feb-2018 Jeff Roberson <jeff@FreeBSD.org>

Make v_wire_count a per-cpu counter(9) counter. This eliminates a
significant source of cache line contention from vm_page_alloc(). Use
accessors and vm_page_unwire_noq() so that the mechanism can be easily
changed in the future.

Reviewed by: markj
Discussed with: kib, glebius
Tested by: pho (earlier version)
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14273


# 13a025d5 09-Feb-2018 Kirk McKusick <mckusick@FreeBSD.org>

Merge biodone_finish() back into biodone(). The primary purpose is
to make the order of operations clearer to avoid the race condition
that was fixed in r328914. In particular, this commit corrects a
similar race that existed in the soft updates callback.

Doing some sleuthing through the SVN repository, it appears that
bufdone_finish() was added to support XFS:

------------------------------------------------------------------------
r153192 | rodrigc | 2005-12-06 19:39:08 -0800 (Tue, 06 Dec 2005) | 13 lines

Changes imported from XFS for FreeBSD project:
- add fields to struct buf (needed by XFS)
- 3 private fields: b_fsprivate1, b_fsprivate2, b_fsprivate3
- b_pin_count, count of pinned buffer

- add new B_MANAGED flag
- add breada() function to initiate asynchronous I/O on read-ahead blocks.
- add bufdone_finish(), bpin(), bunpin_wait() functions

Patches provided by: kan
Reviewed by: phk
Silence on: arch@

------------------------------------------------------------------------

It does not appear to ever have been used for anything else. XFS was
disconnected in r241607:

------------------------------------------------------------------------
r241607 | attilio | 2012-10-16 03:04:00 -0700 (Tue, 16 Oct 2012) | 5 lines

Disconnect non-MPSAFE XFS from the build in preparation for dropping
GIANT from VFS.

This is not targeted for MFC.

------------------------------------------------------------------------

and removed entirely in r247631:

------------------------------------------------------------------------
r247631 | attilio | 2013-03-02 07:33:54 -0800 (Sat, 02 Mar 2013) | 5 lines

Garbage collect XFS bits which are now already completely disconnected
from the tree since few months.

This is not targeted for MFC.

------------------------------------------------------------------------

Since XFS support is gone, there is no reason to retain biodone_finish().

Suggested by: Warner Losh (imp)
Discussed with: cem, kib
Tested by: Peter Holm (pho)


# 1d3a1bcf 07-Feb-2018 Mark Johnston <markj@FreeBSD.org>

Dequeue wired pages lazily.

Previously, wiring a page would cause it to be removed from its page
queue. In the common case, unwiring causes it to be enqueued at the tail
of that page queue. This change modifies vm_page_wire() to not dequeue
the page, thus avoiding the highly contended page queue locks. Instead,
vm_page_unwire() takes care of requeuing the page as a single operation,
and the page daemon dequeues wired pages as they are encountered during
a queue scan to avoid needlessly revisiting them later. For pages in
PQ_ACTIVE we do even better, since a requeue is unnecessary.

The change improves scalability for some common workloads. For instance,
threads wiring pages into the buffer cache no longer need to modify
global page queues, and unwiring is usually done by the bufspace thread,
so concurrency is not as much of an issue. As another example, many
sysctl handlers wire the output buffer to avoid faults on copyout, and
since the buffer is likely to be in PQ_ACTIVE, we now entirely avoid
modifying the page queue in this case.

The change also adds a block comment describing some properties of
struct vm_page's reference counters, and the busy lock.

Reviewed by: jeff
Discussed with: alc, kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D11943


# 47806d1b 05-Feb-2018 Kirk McKusick <mckusick@FreeBSD.org>

Occasional cylinder-group check-hash errors were being reported on
systems running with a heavy filesystem load. Tracking down this
bug was elusive because there were actually two problems. Sometimes
the in-memory check hash was wrong and sometimes the check hash
computed when doing the read was wrong. The occurrence of either
error caused a check-hash mismatch to be reported.

The first error was that the check hash in the in-memory cylinder
group was incorrect. This error was caused by the following
sequence of events:

- We read a cylinder-group buffer and the check hash is valid.
- We update its cg_time and cg_old_time which makes the in-memory
check-hash value invalid but we do not mark the cylinder group dirty.
- We do not make any other changes to the cylinder group, so we
never mark it dirty, thus do not write it out, and hence never
update the incorrect check hash for the in-memory buffer.
- Later, the buffer gets freed, but the page with the old incorrect
check hash is still in the VM cache.
- Later, we read the cylinder group again, and the first page with
the old check hash is still in the VM cache, but some other pages
are not, so we have to do a read.
- The read does not actually get the first page from disk, but rather
from the VM cache, resulting in the old check hash in the buffer.
- The value computed after doing the read does not match causing the
error to be printed.

The fix for this problem is to only set cg_time and cg_old_time as
the cylinder group is being written to disk. This keeps the in-memory
check-hash valid unless the cylinder group has had other modifications
which will require it to be written with a new check hash calculated.
It also requires that the check hash be recalculated in the in-memory
cylinder group when it is marked clean after doing a background write.

The second problem was that the check hash computed at the end of the
read was incorrect because the calculation of the check hash on
completion of the read was being done too soon.

- When a read completes we had the following sequence:

- bufdone()
-- b_ckhashcalc (calculates check hash)
-- bufdone_finish()
--- vfs_vmio_iodone() (replaces bogus pages with the cached ones)

- When we are reading a buffer where one or more pages are already
in memory (but not all pages, or we wouldn't be doing the read),
the I/O is done with bogus_page mapped in for the pages that exist
in the VM cache. This mapping is done to avoid corrupting the
cached pages if there is any I/O overrun. The vfs_vmio_iodone()
function is responsible for replacing the bogus_page(s) with the
cached ones. But we were calculating the check hash before the
bogus_page(s) were replaced. Hence, when we were calculating the
check hash, we were partly reading from bogus_page, which means
we calculated a bad check hash (e.g., because multiple pages have
been mapped to bogus_page, so its contents are indeterminate).

The second fix is to move the check-hash calculation from bufdone()
to bufdone_finish() after the call to vfs_vmio_iodone() so that it
computes the check hash over the correct set of pages.

With these two changes, the occasional cylinder-group check-hash
errors are gone.

Submitted by: David Pfitzner <dpfitzner@netflix.com>
Reviewed by: kib
Tested by: David Pfitzner


# 3e4f610d 17-Jan-2018 Andriy Gapon <avg@FreeBSD.org>

correct read-ahead calculations in vfs_bio_getpages

Previously the calculations were done as if the requested region
ended at the start of the last requested page, not its end.
The problem as actually quite minor as it affected only stats and
page prefaulting, not the actual page data, and only with specific
parameters.

Reviewed by: kib (previous version)
MFC after: 2 weeks


# ab3185d1 12-Jan-2018 Jeff Roberson <jeff@FreeBSD.org>

Implement NUMA support in uma(9) and malloc(9). Allocations from specific
domains can be done by the _domain() API variants. UMA also supports a
first-touch policy via the NUMA zone flag.

The slab layer is now segregated by VM domains and is precise. It handles
iteration for round-robin directly. The per-cpu cache layer remains
a mix of domains according to where memory is allocated and freed. Well
behaved clients can achieve perfect locality with no performance penalty.

The direct domain allocation functions have to visit the slab layer and
so require per-zone locks which come at some expense.

Reviewed by: Attilio (a slightly older version)
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon


# 8a36da99 27-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/kern: adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.


# cab229b2 20-Nov-2017 Scott Long <scottl@FreeBSD.org>

Update a comment in brelse() to match reality.


# 8d6fbbb8 07-Nov-2017 Jeff Roberson <jeff@FreeBSD.org>

Replace manyinstances of VM_WAIT with blocking page allocation flags
similar to the kernel memory allocator.

This simplifies NUMA allocation because the domain will be known at wait
time and races between failure and sleeping are eliminated. This also
reduces boilerplate code and simplifies callers.

A wait primitive is supplied for uma zones for similar reasons. This
eliminates some non-specific VM_WAIT calls in favor of more explicit
sleeps that may be satisfied without new pages.

Reviewed by: alc, kib, markj
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon


# 75e3597a 21-Sep-2017 Kirk McKusick <mckusick@FreeBSD.org>

Continuing efforts to provide hardening of FFS, this change adds a
check hash to cylinder groups. If a check hash fails when a cylinder
group is read, no further allocations are attempted in that cylinder
group until it has been fixed by fsck. This avoids a class of
filesystem panics related to corrupted cylinder group maps. The
hash is done using crc32c.

Check hases are added only to UFS2 and not to UFS1 as UFS1 is primarily
used in embedded systems with small memories and low-powered processors
which need as light-weight a filesystem as possible.

Specifics of the changes:

sys/sys/buf.h:
Add BX_FSPRIV to reserve a set of eight b_xflags that may be used
by individual filesystems for their own purpose. Their specific
definitions are found in the header files for each filesystem
that uses them. Also add fields to struct buf as noted below.

sys/kern/vfs_bio.c:
It is only necessary to compute a check hash for a cylinder
group when it is actually read from disk. When calling bread,
you do not know whether the buffer was found in the cache or
read. So a new flag (GB_CKHASH) and a pointer to a function to
perform the hash has been added to breadn_flags to say that the
function should be called to calculate a hash if the data has
been read. The check hash is placed in b_ckhash and the B_CKHASH
flag is set to indicate that a read was done and a check hash
calculated. Though a rather elaborate mechanism, it should
also work for check hashing other metadata in the future. A
kernel internal API change was to change breada into a static
fucntion and add flags and a function pointer to a check-hash
function.

sys/ufs/ffs/fs.h:
Add flags for types of check hashes; stored in a new word in the
superblock. Define corresponding BX_ flags for the different types
of check hashes. Add a check hash word in the cylinder group.

sys/ufs/ffs/ffs_alloc.c:
In ffs_getcg do the dance with breadn_flags to get a check hash and
if one is provided, check it.

sys/ufs/ffs/ffs_vfsops.c:
Copy across the BX_FFSTYPES flags in background writes.
Update the check hash when writing out buffers that need them.

sys/ufs/ffs/ffs_snapshot.c:
Recompute check hash when updating snapshot cylinder groups.

sys/libkern/crc32.c:
lib/libufs/Makefile:
lib/libufs/libufs.h:
lib/libufs/cgroup.c:
Include libkern/crc32.c in libufs and use it to compute check
hashes when updating cylinder groups.

Four utilities are affected:

sbin/newfs/mkfs.c:
Add the check hashes when building the cylinder groups.

sbin/fsck_ffs/fsck.h:
sbin/fsck_ffs/fsutil.c:
Verify and update check hashes when checking and writing cylinder groups.

sbin/fsck_ffs/pass5.c:
Offer to add check hashes to existing filesystems.
Precompute check hashes when rebuilding cylinder group
(although this will be done when it is written in fsutil.c
it is necessary to do it early before comparing with the old
cylinder group)

sbin/dumpfs/dumpfs.c
Print out the new check hash flag(s)

sbin/fsdb/Makefile:
Needs to add libufs now used by pass5.c imported from fsck_ffs.

Reviewed by: kib
Tested by: Peter Holm (pho)


# fe933c1d 06-Sep-2017 Mateusz Guzik <mjg@FreeBSD.org>

Start annotating global _padalign locks with __exclusive_cache_line

While these locks are guarnteed to not share their respective cache lines,
their current placement leaves unnecessary holes in lines which preceeded them.

For instance the annotation of vm_page_queue_free_mtx allows 2 neighbour
cachelines (previously separate by the lock) to be collapsed into 1.

The annotation is only effective on architectures which have it implemented in
their linker script (currently only amd64). Thus locks are not converted to
their not-padaligned variants as to not affect the rest.

MFC after: 1 week


# 9df950b3 11-Aug-2017 Mark Johnston <markj@FreeBSD.org>

Modify vm_page_grab_pages() to handle VM_ALLOC_NOWAIT.

This will allow its use in sendfile_swapin().

Reviewed by: alc, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11942


# 5471caf6 08-Aug-2017 Alan Cox <alc@FreeBSD.org>

Introduce vm_page_grab_pages(), which is intended to replace loops calling
vm_page_grab() on consecutive page indices. Besides simplifying the code
in the caller, vm_page_grab_pages() allows for batching optimizations.
For example, the current implementation replaces calls to vm_page_lookup()
on consecutive page indices by cheaper calls to vm_page_next().

Reviewed by: kib, markj
Tested by: pho (an earlier version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11926


# 6c7ebc24 31-Jul-2017 Mark Johnston <markj@FreeBSD.org>

Batch v_wire_count decrements in vm_hold_free_pages().

Atomic updates to v_wire_count are a significant source of contention, so
combine multiple updates into one in this easy case. Also remove an old
printf that gets executed if the page is shared-busied, which is a case
that will lead to a panic anyway.

Reviewed by: alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11791


# d1c5e240 17-Jun-2017 Rick Macklem <rmacklem@FreeBSD.org>

Make MAXBCACHEBUF a tunable called vfs.maxbcachebuf.

By making MAXBCACHEBUF a tunable, it can be increased to allow for
larger read/write data sizes for the NFS client.
The tunable is limited to MAXPHYS, which is currently 128K.
Making MAXPHYS a tunable or increasing its value is being discussed,
since it would be nice to support a read/write data size of 1Mbyte
for the NFS client when mounting the AmazonEFS file service.

Reviewed by: kib
MFC after: 2 weeks
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D10991


# 04005c2f 23-Apr-2017 Edward Tomasz Napierala <trasz@FreeBSD.org>

Make it possible to terminate "show lockedbufs" by pressing "q".

MFC after: 2 weeks


# 10be9457 23-Apr-2017 Edward Tomasz Napierala <trasz@FreeBSD.org>

Improve BUF_TRACKING by not displaying NULL entries.

Reviewed by: cem
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D10443


# 83c9dea1 17-Apr-2017 Gleb Smirnoff <glebius@FreeBSD.org>

- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.

Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.

This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html

Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156


# b66f26e9 14-Apr-2017 Edward Tomasz Napierala <trasz@FreeBSD.org>

Don't try to write out bufs that have already failed with ENXIO.
This fixes some panics after disconnecting mounted disks.

Submitted by: imp (slightly different version, which I've then lost)
Reviewed by: kib, imp, mckusick
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D9674


# ac46d386 19-Mar-2017 Alan Cox <alc@FreeBSD.org>

Style fixes. In particular, the variable "bogus" is used like a Boolean.
Define it as such.

Reviewed by: kib
MFC after: 1 week


# d1780e8d 14-Mar-2017 Konstantin Belousov <kib@FreeBSD.org>

Use atop() instead of OFF_TO_IDX() for convertion of addresses or
addresses offsets, as intended.

Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 90e17792 10-Jan-2017 Mark Johnston <markj@FreeBSD.org>

Do not set BIO_DONE if the BIO specifies a completion handler.

biowait() will otherwise race with completions of such BIOs. In-tree code
only calls biowait() on BIOs that do not specify a handler, so this change
should not have any functional impact.

Reviewed by: mav
MFC after: 1 month
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D9070


# bfc8c24c 04-Jan-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Move bogus_page declaration to vm_page.h and initialization to vm_page.c.

Reviewed by: kib


# 99e6e193 23-Nov-2016 Mark Johnston <markj@FreeBSD.org>

Release laundered vnode pages to the head of the inactive queue.

The swap pager enqueues laundered pages near the head of the inactive queue
to avoid another trip through LRU before reclamation. This change adds
support for this behaviour to the vnode pager and makes use of it in UFS and
ext2fs. Some ioflag handling is consolidated into a common subroutine so
that this support can be easily extended to other filesystems which make use
of the buffer cache. No changes are needed for ZFS since its putpages
routine always undirties the pages before returning, and the laundry
thread requeues the pages appropriately in this case.

Reviewed by: alc, kib
Differential Revision: https://reviews.freebsd.org/D8589


# eb962424 22-Nov-2016 Konstantin Belousov <kib@FreeBSD.org>

Restore vnode pager statistic for buffer pagers.

Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D8585


# 8ffa01a0 14-Nov-2016 Adrian Chadd <adrian@FreeBSD.org>

[mips] enable relbuf on mips for now to work around page aliasing in mips hardware.

Although the higher end MIPS hardware handles cache aliasing issues in
hardware, the older cores (r4k, etc) and some compile versions of the
newer cores (mips24k, mips34k, mips74k) don't have this feature.
This means we end up with some very unfortunate behaviour that was
made very obvious by some recent changes to the FFS pager by kib.

So, flip this off until we get our MIPS pmap/cache code upgraded to
handle aliased pages in software.

Discussed with: kib, bsdimp, juli


# 9a639daf 08-Nov-2016 Konstantin Belousov <kib@FreeBSD.org>

Tweaks for the buffer pager.

Pass current thread credentials instead of NOCRED.
Only allow unmapped buffers for filesystem which proclaimed the support.

For all filesystems which currently use buffer pager (UFS, msdosfs and
cd9660), the changes are effectively nop.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 8532d381 31-Oct-2016 Conrad Meyer <cem@FreeBSD.org>

Add BUF_TRACKING and FULL_BUF_TRACKING buffer debugging

Upstream the BUF_TRACKING and FULL_BUF_TRACKING buffer debugging code.
This can be handy in tracking down what code touched hung bios and bufs
last. The full history is especially useful, but adds enough bloat that
it shouldn't be enabled in release builds.

Function names (or arbitrary string constants) are tracked in a
fixed-size ring in bufs. Bios gain a pointer to the upper buf for
tracking. SCSI CCBs gain a pointer to the upper bio for tracking.

Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8366


# c39baa74 28-Oct-2016 Konstantin Belousov <kib@FreeBSD.org>

Generalize UFS buffer pager to allow it serving other filesystems
which also use buffer cache.

Most important addition to the code is the handling of filesystems
where the block size is less than the machine page size, which might
require reading several buffers to validate single page.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 5975e53d 13-Oct-2016 Konstantin Belousov <kib@FreeBSD.org>

Fix a race in vm_page_busy_sleep(9).

Suppose that we have an exclusively busy page, and a thread which can
accept shared-busy page. In this case, typical code waiting for the
page xbusy state to pass is
again:
VM_OBJECT_WLOCK(object);
...
if (vm_page_xbusied(m)) {
vm_page_lock(m);
VM_OBJECT_WUNLOCK(object); <---1
vm_page_busy_sleep(p, "vmopax");
goto again;
}

Suppose that the xbusy state owner locked the object, unbusied the
page and unlocked the object after we are at the line [1], but before we
executed the load of the busy_lock word in vm_page_busy_sleep(). If it
happens that there is still no waiters recorded for the busy state,
the xbusy owner did not acquired the page lock, so it proceeded.

More, suppose that some other thread happen to share-busy the page
after xbusy state was relinquished but before the m->busy_lock is read
in vm_page_busy_sleep(). Again, that thread only needs vm_object lock
to proceed. Then, vm_page_busy_sleep() reads busy_lock value equal to
the VPB_SHARERS_WORD(1).

In this case, all tests in vm_page_busy_sleep(9) pass and we are going
to sleep, despite the page being share-busied.

Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9)
to also accept shared-busy state if we only wait for the xbusy state to
pass.

Merge sequential if()s with the same 'then' clause in
vm_page_busy_sleep().

Note that the current code does not share-busy pages from parallel
threads, the only way to have more that one sbusy owner is right now
is to recurse.

Reported and tested by: pho (previous version)
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D8196


# f43292ec 05-Oct-2016 Conrad Meyer <cem@FreeBSD.org>

vfs_bio: Remove a leading space (style)

Introduced in r282085.

Sponsored by: Dell EMC Isilon


# 8660b707 30-Sep-2016 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the __bo_vnode field from struct vnode

The pointer can be obtained using __containerof instead.

Reviewed by: kib


# 50048173 11-Aug-2016 Mark Johnston <markj@FreeBSD.org>

Remove b_pin_count from struct buf.

It was added in r153192 for XFS and doesn't appear to have been used for
anything else. XFS was disconnected in r241607 and removed entirely in
r247631.

Reported by: mlaier
Reviewed by: imp, kib
Differential Revision: https://reviews.freebsd.org/D7468


# 0649caae 14-Jul-2016 Mark Johnston <markj@FreeBSD.org>

Let DDB's buf printer handle NULL pointers in the buf page array.

A buf's b_pages and b_npages fields may be inconsistent after a panic.
For instance, vfs_vmio_invalidate() sets b_npages to zero only after all
pages are unwired and their page array entries are cleared.

MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division


# 31b67320 29-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/kern: spelling fixes.

Mostly on comments but affects some debug messages.

MFC after: 2 weeks


# f3b827f3 29-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

bufs: make B_DIRTY and B_PERSISTENT flags available

It appears these flags were related to ext2fs but are completely
unused nowadays. Retire them.

Suggested by: mckusick


# d9c9c81c 21-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: use our roundup2/rounddown2() macros when param.h is available.

rounddown2 tends to produce longer lines than the original code
and when the code has a high indentation level it was not really
advantageous to do the replacement.

This tries to strike a balance between readability using the macros
and flexibility of having the expressions, so not everything is
converted.


# ae34b6ff 06-Apr-2016 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add four new RCTL resources - readbps, readiops, writebps and writeiops,
for limiting disk (actually filesystem) IO.

Note that in some cases these limits are not quite precise. It's ok,
as long as it's within some reasonable bounds.

Testing - and review of the code, in particular the VFS and VM parts - is
very welcome.

MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5080


# 75599120 07-Feb-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

Minor grammar fix in comment.


# d2a28cb0 27-Jan-2016 Kirk McKusick <mckusick@FreeBSD.org>

The bread() function was inconsistent about whether it would return
a buffer pointer in the event of an error (for some errors it would
return a buffer pointer and for other errors it would not return a
buffer pointer). The cluster_read() function was similarly inconsistent.

Clients of these functions were inconsistent in handling errors.
Some would assume that no buffer was returned after an error and
would thus lose buffers under certain error conditions. Others would
assume that brelse() should always be called after an error and
would thus panic the system under certain error conditions.

To correct both of these problems with minimal code churn, bread()
and cluster_write() now always free the buffer when returning an
error thus ensuring that buffers will never be lost. The brelse()
routine checks for being passed a NULL buffer pointer and silently
returns to avoid panics. Thus both approaches to handling error
returns from bread() and cluster_read() will work correctly.

Future code should be written assuming that bread() and cluster_read()
will never return a buffer with an error, so should not attempt to
brelse() the buffer when an error is returned.

Reviewed by: kib


# 45130aa1 15-Dec-2015 Adrian Chadd <adrian@FreeBSD.org>

Don't call wakeup if we're just returning reserved space; just
return the reservation and wait for more space to appear.

Submitted by: jeff
Reviewed by: kib


# f1b13a89 08-Dec-2015 Steven Hartland <smh@FreeBSD.org>

Don't use 0 for pointer comparison

Use NULL instead of 0 for comparison with panicstr.

MFC after: 1 week
Sponsored by: Multiplay


# 2eb936ac 06-Nov-2015 Adrian Chadd <adrian@FreeBSD.org>

Add a sched_yield() to work around low memory conditions in the current code.

Things seem to get stuck in low memory conditions where no bufs are available,
the reclamation path is called to wakeup the daemon, but no sleeping is done.
Because of this, we are stuck in a tight loop in the current process and
never run said reclamation path.

This was introduced in r289279 . This is only a temporary workaround
to restore system usefulness until the more permanent solutions can be
found.

Tested:

* Carambola2, 64MB (and 32MB by manual config.)


# a1f26210 30-Oct-2015 Warner Losh <imp@FreeBSD.org>

The error classification from lower layers is a poor indicator of
whether an error is recoverable. Always re-dirty the buffer on errors
from write requests. The invalidation we used to do for errors not EIO
doesn't need to be done for a device that's really gone, since that's
done in a different path.

Reviewed by: mckusick@, kib@


# 156c04f7 29-Oct-2015 Bryan Drewery <bdrewery@FreeBSD.org>

getnewbuf: Initialize bp to avoid uninitialized pointer dereference and brelse().

This came in recently in r289279.

Coverity CID: 1331561


# 21fae961 13-Oct-2015 Jeff Roberson <jeff@FreeBSD.org>

Parallelize the buffer cache and rewrite getnewbuf(). This results in a
8x performance improvement in a micro benchmark on a 4 socket machine.

- Get buffer headers from a per-cpu uma cache that sits in from of the
free queue.
- Use a per-cpu quantum cache in vmem to eliminate contention for kva.
- Use multiple clean queues according to buffer cache size to eliminate
clean queue lock contention.
- Introduce a bufspace daemon that attempts to prevent getnewbuf() callers
from blocking or doing direct recycling.
- Close some bufspace allocation races that could lead to endless
recycling.
- Further the transition to a more modern style of small functions grouped
by prefix in order to improve growing complexity.

Sponsored by: EMC / Isilon
Reviewed by: kib
Tested by: pho


# acada7ae 03-Oct-2015 Alan Cox <alc@FreeBSD.org>

Perform a single batched update to the object's paging-in-progress count
rather than updating it for each page.


# 3138cd36 30-Sep-2015 Mark Johnston <markj@FreeBSD.org>

As a step towards the elimination of PG_CACHED pages, rework the handling
of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to
the head of the inactive queue instead of being cached.

This affects the implementation of POSIX_FADV_NOREUSE as well, since it
works by applying POSIX_FADV_DONTNEED to file ranges after they have been
read or written. At that point the corresponding buffers may still be
dirty, so the previous implementation would coalesce successive ranges and
apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the
dirty buffers would eventually be cached. To preserve this behaviour in an
efficient manner, this change adds a new buf flag, B_NOREUSE, which causes
the pages backing a VMIO buf to be placed at the head of the inactive queue
when the buf is released. POSIX_FADV_NOREUSE then works by setting this
flag in bufs that underlie the specified range.

Reviewed by: alc, kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3726


# 4615830d 26-Sep-2015 Jeff Roberson <jeff@FreeBSD.org>

- Collapse vfs_vmio_truncate & vfs_vmio_release into a single function.
- Allow vfs_vmio_invalidate() to free the pages, leaving us with a
single loop and bufobj lock when B_NOCACHE/B_INVAL is used.
- Eliminate the special B_ASYNC handling on free that has not been
relevant for some time.
- Remove the extraneous page busy from vfs_vmio_truncate().

Reviewed by: kib
Tested by: pho
Sponsored by: EMC / Isilon storage division


# 589c956a 23-Sep-2015 Jeff Roberson <jeff@FreeBSD.org>

- Fix a nonsense reordering that somehow slipped into my last diff.

Reported by: pho


# 8264830c 22-Sep-2015 Jeff Roberson <jeff@FreeBSD.org>

Some refactoring of the buf/vm interface.
- Eliminate bogus page replacement that is inconsistently applied in the
invalidation loop in brelse. This has been a no-op in modern times as
biodone() is responsible for cleaning up after bogus pages. This
would've spammed the console with printfs at a minimum.
- Allow the compiler and human readers alike to reason about allocbuf()
by splitting it into constituent parts.
- Separate the VM manipulating and buf manipulating code in brelse() and
bufdone() so that the intentions are clear. This makes it evident that
there are several duplicated buf pages loops that will be consolidated
at a later time.

Reviewed by: kib
Tested by: pho
Sponsored by: EMC / Isilon Storage Division


# 15aaea78 22-Sep-2015 Alan Cox <alc@FreeBSD.org>

Change vm_page_unwire() such that it (1) accepts PQ_NONE as the specified
queue and (2) returns a Boolean indicating whether the page's wire count
transitioned to zero.

Exploit this change in vfs_vmio_release() to avoid pointlessly enqueueing
a page that is about to be freed.

(An earlier version of this change was developed by attilio@ and kmacy@.
Any errors in this version are my own.)

Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division


# 7297c5e5 10-Sep-2015 Warner Losh <imp@FreeBSD.org>

bufdonebio is now unused. Retire it too.


# ad8d57a9 10-Sep-2015 Warner Losh <imp@FreeBSD.org>

dev_strategy and dev_strategy_csw are unused since r281825. Remove
them.

Differential Revision: https://reviews.freebsd.org/D3620


# c023d823 30-Jul-2015 Roger Pau Monné <royger@FreeBSD.org>

vfs: fill fallout from r286076

This right operator is >= not =>.

Reported by: cem


# 8f89a299 30-Jul-2015 Roger Pau Monné <royger@FreeBSD.org>

vfs: fix off-by-one error in vfs_buf_check_mapped

The check added in r285872 can trigger for valid buffers if the buffer space
used happens to be just after unmapped_buf in KVA space.

Discussed with: kib
Sponsored by: Citrix Systems R&D


# 6cebf7e2 29-Jul-2015 Konstantin Belousov <kib@FreeBSD.org>

Move bufshutdown() out of the #ifdef INVARIANTS block.


# 98082691 28-Jul-2015 Jeff Roberson <jeff@FreeBSD.org>

- Make 'struct buf *buf' private to vfs_bio.c. Having a global variable
'buf' is inconvenient and has lead me to some irritating to discover
bugs over the years. It also makes it more challenging to refactor
the buf allocation system.
- Move swbuf and declare it as an extern in vfs_bio.c. This is still
not perfect but better than it was before.
- Eliminate the unused ffs function that relied on knowledge of the buf
array.
- Move the shutdown code that iterates over the buf array into vfs_bio.c.

Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division


# 38750ada 28-Jul-2015 Jeff Roberson <jeff@FreeBSD.org>

- Eliminate the EMPTYKVA queue. It served as a cache of KVA allocations
attached to bufs to avoid the overhead of the vm. This purposes is now
better served by vmem. Freeing the kva immediately when a buf is
destroyed leads to lower fragmentation and a much simpler scan algorithm.

Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division


# 6fd04eff 25-Jul-2015 Konstantin Belousov <kib@FreeBSD.org>

With the removal of b_saveaddr in the r285819, b_data must be reset to
b_kvabase when the buffer is reclaimed. Otherwise, if b_data for the
mapped buffer was adjusted with the page-offset portion of b_offset,
nothing would re-adjust the b_data, which breaks buffer management
code which expects page-aligned b_data (see e.g. bpman_qenter(), which
skips partial pages).

Fix a minor issue with the GB_KVAALLOC requests, which could result in
returning the mapped buffer if the reused buffer is mapped and have
the right amount of KVA reserved.

Improve assertion in the vfs_buf_check_mapped() to catch unmapped
buffers which have their b_data incorrectly adjusted with offset.

Reported and tested by: pho (previous version)
Reviewed by: jeff (previous version)
Sponsored by: The FreeBSD Foundation


# fade8dd7 23-Jul-2015 Jeff Roberson <jeff@FreeBSD.org>

Refactor unmapped buffer address handling.
- Use pointer assignment rather than a combination of pointers and
flags to switch buffers between unmapped and mapped. This eliminates
multiple flags and generally simplifies the logic.
- Eliminate b_saveaddr since it is only used with pager bufs which have
their b_data re-initialized on each allocation.
- Gather up some convenience routines in the buffer cache for
manipulating buf space and buf malloc space.
- Add an inline, buf_mapped(), to standardize checks around unmapped
buffers.

In collaboration with: mlaier
Reviewed by: kib
Tested by: pho (many small revisions ago)
Sponsored by: EMC / Isilon Storage Division


# 1c1ddc03 22-Jul-2015 Jeff Roberson <jeff@FreeBSD.org>

- Don't defeat the FIFO nature of the buffer cache by eliminating the
most recently used buffer when we are under paging pressure. This is
a perversion of the buffer and page replacement algorithms and recent
improvements to the page daemon have rendered it unnecessary. In the
event that low-memory deadlocks become an issue it would be possible
to make a daemon or event handler that performs a similar action on
the oldest buffers rather than the newest. Since the buf cache
is analogous to the page cache and some minimum working set is desired
another possibility is to simply shrink the minimum working set which
has less downside now that file pages are not directly mapped.

Sponsored by: EMC / Isilon
Reviewed by: alc, kib (with some minor objection)
Tested by: pho


# cf88021a 11-Jul-2015 Konstantin Belousov <kib@FreeBSD.org>

Do not allow creation of the dirty buffers for the dead buffer
objects, i.e. for buffer objects which vnode was reclaimed. Buffer
cache cannot write such buffers. Return the error and discard the
buffer immediately on write attempt.

BO_DIRTY now always set during vnode reclamation, since it is used not
only for the INVARIANTS checks. Do allow placement of the clean
buffers on dead bufobj list, otherwise filesystems cannot use bufcache
at all after the devvp reclaim.

Reported and tested by: trasz
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# b2c3df84 27-Jun-2015 Konstantin Belousov <kib@FreeBSD.org>

Handle errors from background write of the cylinder group blocks.

First, on the write error, bufdone() call from ffs_backgroundwrite()
panics because pbrelvp() cleared bp->b_bufobj, while brelse() would
try to re-dirty the copy of the cg buffer. Handle this by setting
B_INVAL for the case of BIO_ERROR.

Second, we must re-dirty the real buffer containing the cylinder group
block data when background write failed. Real cg buffer was already
marked clean in ffs_bufwrite(). After the BV_BKGRDINPROG flag is
cleared on the real cg buffer in ffs_backgroundwrite(), buffer scan
may reuse the buffer at any moment. The result is lost write, and if
the write error was only transient, we get corrupted bitmaps.

We cannot re-dirty the original cg buffer in the
ffs_backgroundwritedone(), since the context is not sleepable,
preventing us from sleeping for origbp' lock. Add BV_BKGDERR flag
(protected by the buffer object lock), which is converted into delayed
write by brelse(), bqrelse() and buffer scan.

In collaboration with: Conrad Meyer <cse.cem@gmail.com>
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation (kib),
EMC/Isilon storage division (Conrad)
MFC after: 2 weeks


# b05c401f 23-Jun-2015 Konstantin Belousov <kib@FreeBSD.org>

Only take previous buffer queue lock (olock) when needed for REMFREE
in binsfree().

Submitted by: Conrad Meyer
Sponsored by: EMC / Isilon Storage Division
Review: https://reviews.freebsd.org/D2882
MFC after: 1 week


# 9a2c8535 27-Apr-2015 Konstantin Belousov <kib@FreeBSD.org>

Partially revert r255986: do not call VOP_FSYNC() when helping
bufdaemon in getnewbuf(), do use buf_flush(). The difference is that
bufdaemon uses TRYLOCK to get buffer locks, which allows calls to
getnewbuf() while another buffer is locked.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 7cfdc2a7 24-Apr-2015 Rick Macklem <rmacklem@FreeBSD.org>

MAXBSIZE defines both the largest UFS block size and the
largest size for a buffer in the buffer cache. This patch
defines a new constant MAXBCACHEBUF, which is the largest
size for a buffer in the buffer cache. Having a separate
constant allows MAXBCACHEBUF to be set larger than MAXBSIZE
on a per-architecture basis, so that NFS can do larger read/writes
for these architectures. It modifies sys/param.h so that BKVASIZE
can also be set on a per-architecture basis.
A couple of cases where NFS used MAXBSIZE instead of NFS_MAXBSIZE
is fixed as well.

Differential Revision: https://reviews.freebsd.org/D2330
Reviewed by: mav, kib
MFC after: 2 weeks


# 43348dc2 16-Mar-2015 Benno Rice <benno@FreeBSD.org>

Reset bp->bio_done to unmapped_buf when removing a transient map in biodone.

Submitted by: Scott Ferris <scott.ferris@isilon.com>
Sponsored by: EMC / Isilon Storage Division
Reviewed by: kib


# 904ed548 08-Dec-2014 Konstantin Belousov <kib@FreeBSD.org>

When getnewbuf_reuse_bp() is called to reclaim some (clean) buffer,
the vnode owning the buffer is not locked. More, it cannot be locked
safely, since getnewbuf_reuse_bp() is called from newbuf(), and some
other vnode is already locked, for which reused buffer will be
reassigned.

As the consequence, reclamation of the owning vnode could go in
parallel, in particular, the call to vnode_destroy_vobject(), which
deallocates the vm object and zeroes the v_bufobj->bo_object. Note
that the pages wired by the buffer are left wired and can be safely
freed by the vfs_vmio_release() without the need for the vm object
lock. Also, seeing stale pointer to the v_object is safe due to vm
object type stability.

Check for bo_bufobj != NULL and cache the value in local variable to
avoid trying to lock NULL vm object.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# ccf8a568 25-Oct-2014 Alexander Motin <mav@FreeBSD.org>

Revert somewhat hackish geom_disk optimization, committed as part of r256880,
and the following r273143 commit, supposed to workaround introduced issue by
quite innocent-looking change.

While there is no clear understanding why, but r273143 is accused in data
corruption in some environments with high I/O load. I personally don't see
any problem in that commit, and possibly it is just a trigger to some other
bug somewhere, but better safe then sorry for now.

Requested by: scottl@
MFC after: 3 days


# 99b9076c 15-Oct-2014 Alexander Motin <mav@FreeBSD.org>

Remove setting BIO_DONE flag for BIOs that have done() method.

This fixes use-after-free, caused by geom_disk, completing same BIO twice
to save extra allocation, and getting BIO_DONE set after the first.

MFC after: 1 week


# 37417245 07-Oct-2014 Jung-uk Kim <jkim@FreeBSD.org>

Make kern.nswbuf tunable from loader.

MFC after: 1 week


# c079e1c0 03-Sep-2014 Benno Rice <benno@FreeBSD.org>

Add KASSERTs to catch the case where a developer may have forgotten to
set bo_bsize on a bufobj.

This is a slight modification of the patch provided.

PR: 193146
Submitted by: Conrad Meyer <conrad.meyer@isilon.com>
Sponsored by: EMC Isilon Storage Division


# 5f9500c3 04-Aug-2014 Kirk McKusick <mckusick@FreeBSD.org>

Add support for multi-threading of soft updates.

Replace a single soft updates thread with a thread per FFS-filesystem
mount point. The threads are associated with the bufdaemon process.

Reviewed by: kib
Tested by: Peter Holm and Scott Long
MFC after: 2 weeks
Sponsored by: Netflix


# 3ae10f74 16-Jun-2014 Attilio Rao <attilio@FreeBSD.org>

- Modify vm_page_unwire() and vm_page_enqueue() to directly accept
the queue where to enqueue pages that are going to be unwired.
- Add stronger checks to the enqueue/dequeue for the pagequeues when
adding and removing pages to them.

Of course, for unmanaged pages the queue parameter of vm_page_unwire() will
be ignored, just as the active parameter today.
This makes adding new pagequeues quicker.

This change effectively modifies the KPI. __FreeBSD_version will be,
however, bumped just when the full cache of free pages will be
evicted.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho


# a19c5d37 09-Jun-2014 Konstantin Belousov <kib@FreeBSD.org>

Devolatile as needed.

Sponsored by: The FreeBSD Foundation
MFC after: 13 days


# 7f82c6c1 08-Jun-2014 Konstantin Belousov <kib@FreeBSD.org>

Change the nblock mutex, protecting the needsbuffer buffer deficit
flags, to rwlock. Lock it in read mode when used from subroutines
called from buffer release code paths.

The needsbuffer is now updated using atomics, while read lock of
nblock prevents loosing the wakeups from bufspacewakeup() and
bufcountadd() in getnewbuf_bufd_help().

In several interesting loads, needsbuffer flags are never set, while
buffers are reused quickly. This causes brelse() and bqrelse() from
different threads to content on the nblock. Now they take nblock in
read mode, together with needsbuffer not needing an update, allowing
higher parallelism.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 23f6698f 08-Jun-2014 Konstantin Belousov <kib@FreeBSD.org>

Initialize the pbuf counter for directio using SYSINIT, instead of
using a direct hook called from kern_vfs_bio_buffer_alloc().
Mark ffs_rawread.c as requiring both ffs and directio options to be
compiled into the kernel. Add ffs_rawread.c to the list of ufs.ko
module' sources.

In addition to stopping breaking the layering violation, it also
allows to link kernel when FFS is configured as module and DIRECTIO is
enabled.

One consequence of the change is that ffs_rawread.o is always linked
into the module regardless of the DIRECTIO option. This is similar to
the option QUOTA and ufs_quota.c.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 44f1c916 22-Mar-2014 Bryan Drewery <bdrewery@FreeBSD.org>

Rename global cnt to vm_cnt to avoid shadowing.

To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from: arch@
Sponsored by: EMC / Isilon Storage Division


# 44afcdab 20-Jan-2014 John Baldwin <jhb@FreeBSD.org>

Fix a typo.


# e136eac2 27-Dec-2013 Konstantin Belousov <kib@FreeBSD.org>

Revert r259200. There are geoms/drivers which do not update
bio_completed, only manage bio_resid, e.g. sa(4).

Reported and tested by: Manfred Antar <null@pozo.com>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# b4aa4fed 10-Dec-2013 Konstantin Belousov <kib@FreeBSD.org>

Fix detection of EOF in kern_physio(). If bio_length was clipped by
the excess code in g_io_check(), bio_resid is also truncated by
g_io_deliver(). As result, bufdonebio() assigns truncated value to
the buffer b_resid field.

Use the residual bio_completed to calculate buffer b_resid from
b_bcount in bufdonebio(), instead of bio_resid, calculated from
bio_length in g_io_deliver().

The issue is seemingly caused by the code rearrange into g_io_check(),
which is not present in stable/10. The change still looks as the
useful change to have in 10 nevertheless.

Reported by: Stefan Hegnauer <stefan.hegnauer@gmx.ch>
Tested by: pho, Stefan Hegnauer <stefan.hegnauer@gmx.ch>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# ba9593f3 15-Nov-2013 John Baldwin <jhb@FreeBSD.org>

Don't allow vfs.lorunningspace or vfs.hirunningspace to be set such
that lorunningspace is greater than hirunningspace as the system
performs terribly if it is mistuned in this fashion.

MFC after: 1 week


# 8160afda 21-Oct-2013 Alexander Motin <mav@FreeBSD.org>

MFprojects/camlock r256619:
Restore BIO_UNMAPPED and BIO_TRANSIENT_MAPPING in biodonne() when unmapping
temporary mapped buffer. That fixes double unmap if biodone() called twice
for the same BIO (but with different done methods).

Move mapping removal before calling bio_done() method. I believe that it is
very wrong to do anything to BIO after reporting completion. kib@ thinks
it was done for some forgotten now case when bio_done() method needed mapped
buffer. But 1) if BIO was sent as unmapped, then IMO done() should be called
in the same way; 2) IMO there is no guatantee that buffer will be mapped at
this point at all, for example, if all underlying stack supports unmapped
I/O, so bio_done() handler can not expect that.


# 77a30af6 16-Oct-2013 Alexander Motin <mav@FreeBSD.org>

MFprojects/camlock r256370:
- Take BIO lock in biodone() only when there is no completion callback set
and so we should wake up thread waiting in biowait().
- Remove msleep() timeout from biowait(). It was added 11 years ago, when
there was no locks used, and it should not be needed any more.


# acb9d2c7 09-Oct-2013 Konstantin Belousov <kib@FreeBSD.org>

The device vnodes are often unlocked when bread() or bwrite() is
called. This probably should be fixed eventually, but for now it is
not needed to try to flush such vnodes from the buffer allocation
context.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (gjb)


# 432e79fc 02-Oct-2013 Konstantin Belousov <kib@FreeBSD.org>

When helping the bufdaemon from the buffer allocation context, there
is no sense to walk the whole dirty buffer queue. We are only
interested in, and can operate on, the buffers owned by the current
vnode [1]. Instead of calling generic queue flush routine, do
VOP_FSYNC() if possible.

Holding the dirty buffer queue lock in the bufdaemon, without dropping
it, can cause starvation of buffer writes from other threads. This is
esp. easy to reproduce on the big memory machines, where large files
are written, causing almost all dirty buffers accumulating in several
big files, which vnodes are locked by writers. Bufdaemon cannot flush
any buffer, but is iterating over the whole dirty queue
continuously. Since dirty queue mutex is not dropped, bufdone() in
g_up thread is starved, usually deadlocking the machine [2]. Mitigate
this by dropping the queue lock after the vnode is locked, allowing
other queue lock contenders to make a progress.

Discussed with: Jeff [1]
Reported by: pho [2]
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (hrs)


# ac341450 29-Sep-2013 Konstantin Belousov <kib@FreeBSD.org>

Reimplement r255797 using LK_TRYUPGRADE.

The r255797 was:
Increase the chance of the buffer write from the bufdaemon helper
context to succeed. If the locked vnode which owns the buffer to be
written is shared locked, try the non-blocking upgrade of the lock to
exclusive.

PR: kern/178997
Reported and tested by: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (glebius)


# 12af71a6 22-Sep-2013 Konstantin Belousov <kib@FreeBSD.org>

Revert r255797. The LK_UPGRADE | LK_NOWAIT drops the lock.

Approved by: re (marius, implicit)


# d1f8ca48 22-Sep-2013 Konstantin Belousov <kib@FreeBSD.org>

Increase the chance of the buffer write from the bufdaemon helper
context to succeed. If the locked vnode which owns the buffer to be
written is shared locked, try the non-blocking upgrade of the lock to
exclusive.

PR: kern/178997
Reported and tested by: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (marius)


# 3aaea6ef 08-Sep-2013 Konstantin Belousov <kib@FreeBSD.org>

Drain for the xbusy state for two places which potentially do
pmap_remove_all(). Not doing the drain allows the pmap_enter() to
proceed in parallel, making the pmap_remove_all() effects void.

The race results in an invalidated page mapped wired by usermode.

Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (glebius)


# a677b314 04-Sep-2013 Konstantin Belousov <kib@FreeBSD.org>

The vm_pageout_flush() functions sbusies pages in the passed pages
run. After that, the pager put method is called, usually translated
to VOP_WRITE(). For the filesystems which use buffer cache,
bufwrite() sbusies the buffer pages again, waiting for the xbusy state
to drain. The later is done in vfs_drain_busy_pages(), which is
called with the buffer pages already sbusied (by vm_pageout_flush()).

Since vfs_drain_busy_pages() can only wait for one page at the time,
and during the wait, the object lock is dropped, previous pages in the
buffer must be protected from other threads busying them. Up to the
moment, it was done by xbusying the pages, that is incompatible with
the sbusy state in the new implementation of busy. Switch to sbusy.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation


# 4f8cf6e5 22-Aug-2013 Konstantin Belousov <kib@FreeBSD.org>

Both cluster_rbuild() and cluster_wbuild() sometimes set the pages
shared busy without first draining the hard busy state. Previously it
went unnoticed since VPO_BUSY and m->busy fields were distinct, and
vm_page_io_start() did not verified that the passed page has VPO_BUSY
flag cleared, but such page state is wrong. New implementation is
more strict and catched this case.

Drain the busy state as needed, before calling vm_page_sbusy().

Tested by: pho, jkim
Sponsored by: The FreeBSD Foundation


# 5944de8e 22-Aug-2013 Konstantin Belousov <kib@FreeBSD.org>

Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9).
The flag was mandatory since r209792, where vm_page_grab(9) was
changed to only support the alloc retry semantic.

Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation


# c7aebda8 09-Aug-2013 Attilio Rao <attilio@FreeBSD.org>

The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
and vm_page_grab are being executed. This will be very helpful
once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff, kib
Tested by: gavin, bapt (older version)
Tested by: pho, scottl


# 5df87b21 07-Aug-2013 Jeff Roberson <jeff@FreeBSD.org>

Replace kernel virtual address space allocation with vmem. This provides
transparent layering and better fragmentation.

- Normalize functions that allocate memory to use kmem_*
- Those that allocate address space are named kva_*
- Those that operate on maps are named kmap_*
- Implement recursive allocation handling for kmem_arena in vmem.

Reviewed by: alc
Tested by: pho
Sponsored by: EMC / Isilon Storage Division


# 982d7712 13-Jul-2013 Konstantin Belousov <kib@FreeBSD.org>

Assert that runningbufspace does not underflow.

Sponsored by: The FreeBSD Foundation


# da4ca6c8 13-Jul-2013 Konstantin Belousov <kib@FreeBSD.org>

There is no need to count waiters for the runningbufspace.

Sponsored by: The FreeBSD Foundation


# 92e53673 10-Jul-2013 Konstantin Belousov <kib@FreeBSD.org>

Do not invalidate page of the B_NOCACHE buffer or buffer after an I/O
error if any user wired mappings exist. Doing the invalidation
destroys the user wiring.

The change is the temporal measure to close the bug, the more proper
fix is to delegate the invalidation of the page to upper layers
always.

Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# d7b5c50b 07-Jul-2013 Alfred Perlstein <alfred@FreeBSD.org>

Make kassert_printf use __printflike.

Fix associated errors/warnings while I'm here.

Requested by: avg


# 5f518366 27-Jun-2013 Jeff Roberson <jeff@FreeBSD.org>

- Add a general purpose resource allocator, vmem, from NetBSD. It was
originally inspired by the Solaris vmem detailed in the proceedings
of usenix 2001. The NetBSD version was heavily refactored for bugs
and simplicity.
- Use this resource allocator to allocate the buffer and transient maps.
Buffer cache defrags are reduced by 25% when used by filesystems with
mixed block sizes. Ultimately this may permit dynamic buffer cache
sizing on low KVA machines.

Discussed with: alc, kib, attilio
Tested by: pho
Sponsored by: EMC / Isilon Storage Division


# ba39d89b 05-Jun-2013 Jeff Roberson <jeff@FreeBSD.org>

- Consolidate duplicate code into support functions.
- Split the bqlock into bqclean and bqdirty locks.
- Only acquire the wakeup synchronization locks when we cross a
threshold requiring them.
- Restructure the way flushbufqueues() targets work so they are more
smp friendly and sane.

Reviewed by: kib
Discussed with: mckusick, attilio
Sponsored by: EMC / Isilon Storage Division

M vfs_bio.c


# 92fab43f 02-Jun-2013 Konstantin Belousov <kib@FreeBSD.org>

When auto-sizing the buffer cache, limit the amount of physical memory
used as the estimation of size, to 32GB. This provides around 100K of
buffer headers and corresponding KVA for buffer map at the peak.
Sizing the cache larger is not useful, also resulting in the wasting
and exhausting of KVA for large machines.

Reported and tested by: bdrewery
Sponsored by: The FreeBSD Foundation


# 39a4cd0c 02-Jun-2013 Alan Cox <alc@FreeBSD.org>

Reduce the scope of the VM object locking in brelse(). In my tests, this
change reduced the total number of VM object lock acquisitions by brelse()
by 74%.

Sponsored by: EMC / Isilon Storage Division


# 22a72260 30-May-2013 Jeff Roberson <jeff@FreeBSD.org>

- Convert the bufobj lock to rwlock.
- Use a shared bufobj lock in getblk() and inmem().
- Convert softdep's lk to rwlock to match the bufobj lock.
- Move INFREECNT to b_flags and protect it with the buf lock.
- Remove unnecessary locking around bremfree() and BKGRDINPROG.

Sponsored by: EMC / Isilon Storage Division
Discussed with: mckusick, kib, mdf


# bed927ee 21-May-2013 Attilio Rao <attilio@FreeBSD.org>

vm_object locking is not needed there as pages are already wired.

Sponsored by: EMC / Isilon storage division
Submitted by: alc


# e3ed7ff0 17-May-2013 Attilio Rao <attilio@FreeBSD.org>

Use readlocking now that assertions on vm_page_lookup() are relaxed.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: flo, pho


# d1e99f43 27-Mar-2013 Konstantin Belousov <kib@FreeBSD.org>

Add dev_strategy_csw() function, which is similar to dev_strategy()
but assumes that a thread reference was already obtained on the passed
device. Use the function from physio(), to avoid two extra dev_mtx
lock and unlock. Note that physio() is always used as the cdevsw
method, or is called from a cdevsw method, and the caller already owns
the reference.

dev_strategy() is left to keep KPI intact, but now it is implemented
as a wrapper around dev_strategy_csw().

Do some style cleanup in physio().

Requested and reviewed by: kan (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 88c8c0a7 27-Mar-2013 Konstantin Belousov <kib@FreeBSD.org>

On i386, double the default size of the bio transient map. With the
maxbcache size fixed, the auto-tuned transient map is too small for
real-world load on i386.

Tested by: David Wolfskill
Sponsored by: The FreeBSD Foundation


# 7db07e1c 21-Mar-2013 Konstantin Belousov <kib@FreeBSD.org>

Only size and create the bio_transient_map when unmapped buffers are
enabled. Now, disabling the unmapped buffers should result in the
kernel memory map identical to pre-r248550.

Sponsored by: The FreeBSD Foundation


# e3269b50 20-Mar-2013 Konstantin Belousov <kib@FreeBSD.org>

In bufwrite(), a dirty buffer is moved to the clean queue before the
bufobj counter of the writes in progress is incremented. Other thread
inspecting the bufobj would consider it clean.

For the regular vnodes, the vnode lock is typically held both by the
thread performing the bufwrite() and an other thread doing syncing,
which prevents the situation. On the other hand, writes to the VCHR
vnodes are done without holding vnode lock.

Increment the write ref counter for the buffer object before calling
bundirty().

Sponsored by: The FreeBSD Foundation
Tested by: pho
MFC after: 2 weeks


# e81ff91e 19-Mar-2013 Konstantin Belousov <kib@FreeBSD.org>

Do not remap usermode pages into KVA for physio.

Sponsored by: The FreeBSD Foundation
Tested by: pho


# 7d5365c7 19-Mar-2013 Konstantin Belousov <kib@FreeBSD.org>

Add a helper function vfs_bio_bzero_buf() to zero the portion of the
buffer, transparently handling mapped or unmapped buffers. Its intent
is to replace the use of bzero(bp->b_data) in cases where the buffer
might be unmapped, to avoid unneeded upgrades.

Sponsored by: The FreeBSD Foundation
Tested by: pho


# ee75e7de 19-Mar-2013 Konstantin Belousov <kib@FreeBSD.org>

Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.

The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.

When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.

Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.

The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.

Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.

In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.

By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.

Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks


# 70e198dd 14-Mar-2013 Konstantin Belousov <kib@FreeBSD.org>

Some style fixes.

Sponsored by: The FreeBSD Foundation


# c535690b 14-Mar-2013 Konstantin Belousov <kib@FreeBSD.org>

Add currently unused flag argument to the cluster_read(),
cluster_write() and cluster_wbuild() functions. The flags to be
allowed are a subset of the GB_* flags for getblk().

Sponsored by: The FreeBSD Foundation
Tested by: pho


# a1143a3b 14-Mar-2013 Konstantin Belousov <kib@FreeBSD.org>

Rewrite the vfs_bio_clrbuf(9) to not access the b_data for B_VMIO
buffers directly, use pmap_zero_page_area(9) for each zeroing page
region instead.

Sponsored by: The FreeBSD Foundation
Tested by: pho
MFC after: 2 weeks


# 89f6b863 08-Mar-2013 Attilio Rao <attilio@FreeBSD.org>

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


# 20f4e3e1 27-Feb-2013 Konstantin Belousov <kib@FreeBSD.org>

Make recursive getblk() slightly more useful. Keep the buffer state
intact if getblk() is done on the already owned buffer. Exit from
brelse() early when the lock recursion is detected, otherwise brelse()
might prematurely destroy the buffer under some circumstances.

Sponsored by: The FreeBSD Foundation
Noted by: mckusick
Tested by: pho
MFC after: 2 weeks


# 2bc1a1fe 16-Feb-2013 Kirk McKusick <mckusick@FreeBSD.org>

Add barrier write capability to the VFS buffer interface. A barrier
write is a disk write request that tells the disk that the buffer
being written must be committed to the media along with any writes
that preceeded it before any future blocks may be written to the drive.

Barrier writes are provided by adding the functions bbarrierwrite
(bwrite with barrier) and babarrierwrite (bawrite with barrier).

Following a bbarrierwrite the client knows that the requested buffer
is on the media. It does not ensure that buffers written before that
buffer are on the media. It only ensure that buffers written before
that buffer will get to the media before any buffers written after
that buffer. A flush command must be sent to the disk to ensure that
all earlier written buffers are on the media.

Reviewed by: kib
Tested by: Peter Holm


# b1308d72 21-Dec-2012 Attilio Rao <attilio@FreeBSD.org>

Fixup r218424: uio_yield() was scaling directly to userland priority.
When kern_yield() was introduced with the possibility to specify
a new priority, the behaviour changed by not lowering priority at all
in the consumers, making the yielding mechanism highly ineffective for
high priority kthreads like bufdaemon, syncer, vlrudaemon, etc.
There are no evidences that consumers could bear with such change in
semantic and this situation could finally lead to bugs similar to the
ones fixed in r244240.
Re-specify userland pri for kthreads involved.

Tested by: pho
Reviewed by: kib, mdf
MFC after: 1 week


# 5d439a29 09-Dec-2012 Konstantin Belousov <kib@FreeBSD.org>

Do not ignore zero address, possibly returned by the vm_map_find()
call. The function indicates a failure by the TRUE return value. To
be extra safe, assert that the return value from the following
vm_map_insert() indicates success.

Fix style issues in the nearby lines, reformulate the comment.

Reviewed by: alc (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 17cb8cfc 09-Dec-2012 Konstantin Belousov <kib@FreeBSD.org>

Remove useless comment.

MFC after: 3 days


# 5050aa86 22-Oct-2012 Konstantin Belousov <kib@FreeBSD.org>

Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho


# d1b07fd4 02-Jun-2012 Konstantin Belousov <kib@FreeBSD.org>

Fix typo [1]. Use commas to separate flag printouts, in style with
other parts of function.

Submitted by: bf [1]
MFC after: 1 week


# 705de7c1 02-Jun-2012 Konstantin Belousov <kib@FreeBSD.org>

Update the print mask for decoding b_flags. Add print masks for
b_vflags and b_xflags_t and print them as well.

MFC after: 1 week


# 823c83e8 15-May-2012 Grzegorz Bernacki <gber@FreeBSD.org>

Do not call bremfree for managed buffers.

Calling bremfree for these buffers results in panic:
"bremfree: buffer %p not on a queue."

Approved by: kib


# 35338e60 01-Mar-2012 Kirk McKusick <mckusick@FreeBSD.org>

This change avoids a kernel deadlock on "snaplk" when using
snapshots on UFS filesystems running with journaled soft updates.
This is the first of several bugs that need to be fixed before
removing the restriction added in -r230250 to prevent the use
of snapshots on filesystems running with journaled soft updates.

The deadlock occurs when holding the snapshot lock (snaplk)
and then trying to flush an inode via ffs_update(). We become
blocked by another process trying to flush a different inode
contained in the same inode block that we need. It holds the
inode block for which we are waiting locked. When it tries to
write the inode block, it gets blocked waiting for the our
snaplk when it calls ffs_copyonwrite() to see if the inode
block needs to be copied in our snapshot.

The most obvious place that this deadlock arises is in the
ffs_copyonwrite() routine when it updates critical metadata
in a snapshot and tries to write it out before proceeding.
The fix here is to write the data and indirect block pointer
for the snapshot, but to skip the call to ffs_update() to
write the snapshot inode. To ensure that we will never have
to update a pointer in the inode itself, the ffs_snapshot()
routine that creates the snapshot has to ensure that all the
direct blocks are allocated as part of the creation of the
snapshot.

A less obvious place that this deadlock occurs is when we hold
the snaplk because we are deleting a snapshot. In the course of
doing the deletion, we need to allocate various soft update
dependency structures and allocate some journal space. If we
hit a resource limit while doing this we decrease the resources
in use by flushing out an existing dirty file to get it to give
up the soft dependency resources that it holds. The flush can
cause an ffs_update() to be done on the inode for the file that
we have selected to flush resulting in the same deadlock as
described above when the inode that we have chosen to flush
resides in the same inode block as the snapshot inode that we hold.
The fix is to defer cleaning up any time that the inode on which
we are operating is a snapshot.

Help and review by: Jeff Roberson
Tested by: Peter Holm
MFC (to 9 only) after: 2 weeks


# e5c92401 26-Feb-2012 Alan Cox <alc@FreeBSD.org>

Fix typo.

MFC after: 1 week


# dc874f98 30-Nov-2011 Konstantin Belousov <kib@FreeBSD.org>

Rename vm_page_set_valid() to vm_page_set_valid_range().
The vm_page_set_valid() is the most reasonable name for the m->valid
accessor.

Reviewed by: attilio, alc


# 703dec68 27-Oct-2011 Alan Cox <alc@FreeBSD.org>

Eliminate vestiges of page coloring in VM_ALLOC_NOOBJ calls to
vm_page_alloc(). While I'm here, for the sake of consistency, always
specify the allocation class, such as VM_ALLOC_NORMAL, as the first of
the flags.


# fa2b39a1 07-Sep-2011 Attilio Rao <attilio@FreeBSD.org>

Improve the informations reported in case of busy buffers during the shutdown:
- Axe out the SHOW_BUSYBUFS option and uses a tunable for selectively
enable/disable it, which is defaulted for not printing anything (0
value) but can be changed for printing (1 value) and be verbose (2
value)
- Improves the informations outputed: right now, there is no track of
the actual struct buf object or vnode which are referenced by the
shutdown process, but it is printed the related struct bufobj object
which is not really helpful
- Add more verbosity about the state of the struct buf lock and the
vnode informations, with the latter to be activated separately by the
sysctl

Sponsored by: Sandvine Incorporated
Reviewed by: emaste, kib
Approved by: re (ksmith)
MFC after: 10 days


# 2e569926 05-Jul-2011 Marius Strobl <marius@FreeBSD.org>

Call pmap_qremove() before freeing or unwiring the pages, otherwise
there's a window during which a page can be re-used before its previous
mapping is removed.

Reviewed by: alc
MFC after: 1 week


# 6f59b2bd 10-Jun-2011 Jeff Roberson <jeff@FreeBSD.org>

- When printing bufs with show buf the lblkno is often more useful than
the blkno. Print them both.


# 5e863acb 23-May-2011 Ruslan Ermilov <ru@FreeBSD.org>

BKVASIZE was bumped to 16k more than a decade ago.


# 3d08a76b 12-May-2011 Matthew D Fleming <mdf@FreeBSD.org>

Use a name instead of a magic number for kern_yield(9) when the priority
should not change. Fetch the td_user_pri under the thread lock. This
is probably not necessary but a magic number also seems preferable to
knowing the implementation details here.

Requested by: Jason Behmer < jason DOT behmer AT isilon DOT com >


# d7b20e4b 11-Feb-2011 Alan Cox <alc@FreeBSD.org>

Retire VFS_BIO_DEBUG. Convert those checks that were still valid into
KASSERT()s and eliminate the rest.

Replace excessive printf()s and a panic() in bufdone_finish() with a
KASSERT() in vm_page_io_finish().

Reviewed by: kib


# e7ceb1e9 07-Feb-2011 Matthew D Fleming <mdf@FreeBSD.org>

Based on discussions on the svn-src mailing list, rework r218195:

- entirely eliminate some calls to uio_yeild() as being unnecessary,
such as in a sysctl handler.

- move should_yield() and maybe_yield() to kern_synch.c and move the
prototypes from sys/uio.h to sys/proc.h

- add a slightly more generic kern_yield() that can replace the
functionality of uio_yield().

- replace source uses of uio_yield() with the functional equivalent,
or in some cases do not change the thread priority when switching.

- fix a logic inversion bug in vlrureclaim(), pointed out by bde@.

- instead of using the per-cpu last switched ticks, use a per thread
variable for should_yield(). With PREEMPTION, the only reasonable
use of this is to determine if a lock has been held a long time and
relinquish it. Without PREEMPTION, this is essentially the same as
the per-cpu variable.


# 8189ac85 03-Feb-2011 Alan Cox <alc@FreeBSD.org>

Eliminate unnecessary page hold_count checks. These checks predate
r90944, which introduced a general mechanism for handling the freeing
of held pages.

Reviewed by: kib@


# 50cfe7fa 29-Dec-2010 Konstantin Belousov <kib@FreeBSD.org>

Remove OBJ_CLEANING flag. The vfs_setdirty_locked_object() is the only
consumer of the flag, and it used the flag because OBJ_MIGHTBEDIRTY
was cleared early in vm_object_page_clean, before the cleaning pass
was done. This is no longer true after r216799.

Moreover, since OBJ_CLEANING is a flag, and not the counter, it could
be reset too prematurely when parallel vm_object_page_clean() are
performed.

Reviewed by: alc (as a part of the bigger patch)
MFC after: 1 month (after r216799 is merged)


# 82de724f 25-Dec-2010 Alan Cox <alc@FreeBSD.org>

Introduce and use a new VM interface for temporarily pinning pages. This
new interface replaces the combined use of vm_fault_quick() and
pmap_extract_and_hold() throughout the kernel.

In collaboration with: kib@


# 8c22654d 17-Dec-2010 Alan Cox <alc@FreeBSD.org>

Implement and use a single optimized function for unholding a set of pages.

Reviewed by: kib@


# 61eee6b8 25-Oct-2010 Ivan Voras <ivoras@FreeBSD.org>

Reduce the difference between hirunningspace and lorunningspace,
it should help interactivity in edge cases.


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 3beb1b72 12-Aug-2010 Konstantin Belousov <kib@FreeBSD.org>

The buffers b_vflags field is not always properly protected by
bufobj lock. If b_bufobj is not NULL, then bufobj lock should be
held when manipulating the flags. Not doing this sometimes leaves
BV_BKGRDINPROG to be erronously set, causing softdep' getdirtybuf() to
stuck indefinitely in "getbuf" sleep, waiting for background write to
finish which is not actually performed.

Add BO_LOCK() in the cases where it was missed.

In collaboration with: pho
Tested by: bz
Reviewed by: jeff
MFC after: 1 month


# af7326d4 09-Aug-2010 Ivan Voras <ivoras@FreeBSD.org>

Fix (hopefully) the spelling of "queuing."

Submitted by: bf1783 at gmail com


# dd8c13d5 09-Aug-2010 Ivan Voras <ivoras@FreeBSD.org>

Elaborate on how hirunningspace was chosen.


# 3979450b 06-Aug-2010 Konstantin Belousov <kib@FreeBSD.org>

Add new make_dev_p(9) flag MAKEDEV_ETERNAL to inform devfs that created
cdev will never be destroyed. Propagate the flag to devfs vnodes as
VV_ETERNVALDEV. Use the flags to avoid acquiring devmtx and taking a
thread reference on such nodes.

In collaboration with: pho
MFC after: 1 month


# 984c6473 22-Jul-2010 Ivan Voras <ivoras@FreeBSD.org>

Make lorunningspace catch up with hirunningspace.
While there, add comment about the magic numbers.

Prodded by: alc


# b089a177 20-Jul-2010 Ivan Voras <ivoras@FreeBSD.org>

Fix expression style.

Prodded by: jhb


# 1de98e06 18-Jul-2010 Ivan Voras <ivoras@FreeBSD.org>

In keeping with the Age-of-the-fruitbat theme, scale up hirunningspace on
machines which can clearly afford the memory.

This is a somewhat conservative version of the patch - more fine tuning may be
necessary.

Idea from: Thread on hackers@
Discussed with: alc


# 28823883 11-Jul-2010 Alan Cox <alc@FreeBSD.org>

Change the implementation of vm_hold_free_pages() so that it performs at
most one call to pmap_qremove(), and thus one TLB shootdown, instead of one
call and TLB shootdown per page.

Simplify the interface to vm_hold_free_pages().

MFC after: 3 weeks


# b99348e5 09-Jul-2010 Alan Cox <alc@FreeBSD.org>

Add support for the VM_ALLOC_COUNT() hint to vm_page_alloc(). Consequently,
the maintenance of vm_pageout_deficit can be localized to just two places:
vm_page_alloc() and vm_pageout_scan().

This change also corrects an off-by-one error in the maintenance of
vm_pageout_deficit. Historically, the buffer cache functions, allocbuf()
and vm_hold_load_pages(), have not taken into account that vm_page_alloc()
already increments vm_pageout_deficit by one.

Reviewed by: kib


# 5f195aa3 05-Jul-2010 Konstantin Belousov <kib@FreeBSD.org>

Add the ability for the allocflag argument of the vm_page_grab() to
specify the increment of vm_pageout_deficit when sleeping due to page
shortage. Then, in allocbuf(), the code to allocate pages when extending
vmio buffer can be replaced by a call to vm_page_grab().

Suggested and reviewed by: alc
MFC after: 2 weeks


# f4b9ace4 29-Jun-2010 Alan Cox <alc@FreeBSD.org>

Improve bufdone_finish()'s handling of the bogus page. Specifically, if
one or more mappings to the bogus page must be replaced, call pmap_qenter()
just once. Previously, pmap_qenter() was called for each mapping to the
bogus page.

MFC after: 3 weeks


# d19511c3 11-Jun-2010 Matthew D Fleming <mdf@FreeBSD.org>

Add INVARIANTS checking that numfreebufs values are sane. Also add a
per-buf flag to catch if a buf is double-counted in the free count.
This code was useful to debug an instance where a local patch at Isilon
was incorrectly managing numfreebufs for a new buf state.

Reviewed by: jeff
Approved by: zml (mentor)


# 8d4a7be8 08-Jun-2010 Konstantin Belousov <kib@FreeBSD.org>

Reorganize the code in bdwrite() which handles move of dirtiness
from the buffer pages to buffer. Combine the code to set buffer
dirty range (previously in vfs_setdirty()) and to clean the pages
(vfs_clean_pages()) into new function vfs_clean_pages_dirty_buf(). Now
the vm object lock is acquired only once.

Drain the VPO_BUSY bit of the buffer pages before setting valid
and clean bits in vfs_clean_pages_dirty_buf() with new helper
vfs_drain_busy_pages(). pmap_clear_modify() asserts that page is not
busy.

In vfs_busy_pages(), move the wait for draining of VPO_BUSY before
the dirtyness handling, to follow the structure of
vfs_clean_pages_dirty_buf().

Reported and tested by: pho
Suggested and reviewed by: alc
MFC after: 2 weeks


# c8fa8709 02-Jun-2010 Alan Cox <alc@FreeBSD.org>

Minimize the use of the page queues lock for synchronizing access to the
page's dirty field. With the exception of one case, access to this field
is now synchronized by the object lock.


# e98d019d 24-May-2010 Alan Cox <alc@FreeBSD.org>

Eliminate the acquisition and release of the page queues lock from
vfs_busy_pages(). It is no longer needed.

Submitted by: kib


# 567e51e1 24-May-2010 Alan Cox <alc@FreeBSD.org>

Roughly half of a typical pmap_mincore() implementation is machine-
independent code. Move this code into mincore(), and eliminate the
page queues lock from pmap_mincore().

Push down the page queues lock into pmap_clear_modify(),
pmap_clear_reference(), and pmap_is_modified(). Assert that these
functions are never passed an unmanaged page.

Eliminate an inaccurate comment from powerpc/powerpc/mmu_if.m:
Contrary to what the comment says, pmap_mincore() is not simply an
optimization. Without a complete pmap_mincore() implementation,
mincore() cannot return either MINCORE_MODIFIED or MINCORE_REFERENCED
because only the pmap can provide this information.

Eliminate the page queues lock from vfs_setdirty_locked_object(),
vm_pageout_clean(), vm_object_page_collect_flush(), and
vm_object_page_clean(). Generally speaking, these are all accesses
to the page's dirty field, which are synchronized by the containing
vm object's lock.

Reduce the scope of the page queues lock in vm_object_madvise() and
vm_page_dontneed().

Reviewed by: kib (an earlier version)


# aa12e8b7 18-May-2010 Alan Cox <alc@FreeBSD.org>

The page queues lock is no longer required by vm_page_set_invalid(), so
eliminate it.

Assert that the object containing the page is locked in
vm_page_test_dirty(). Perform some style clean up while I'm here.

Reviewed by: kib


# 3c4a2440 08-May-2010 Alan Cox <alc@FreeBSD.org>

Push down the page queues into vm_page_cache(), vm_page_try_to_cache(), and
vm_page_try_to_free(). Consequently, push down the page queues lock into
pmap_enter_quick(), pmap_page_wired_mapped(), pmap_remove_all(), and
pmap_remove_write().

Push down the page queues lock into Xen's pmap_page_is_mapped(). (I
overlooked the Xen pmap in r207702.)

Switch to a per-processor counter for the total number of pages cached.


# e3ef0d2f 04-May-2010 Alan Cox <alc@FreeBSD.org>

Push down the acquisition of the page queues lock into vm_page_unwire().

Update the comment describing which lock should be held on entry to
vm_page_wire().

Reviewed by: kib


# a7283d32 04-May-2010 Alan Cox <alc@FreeBSD.org>

Add page locking to the vm_page_cow* functions.

Push down the acquisition and release of the page queues lock into
vm_page_wire().

Reviewed by: kib


# c5a64851 03-May-2010 Alan Cox <alc@FreeBSD.org>

Acquire the page lock around vm_page_unwire() and vm_page_wire().

Reviewed by: kib


# 139a0de7 02-May-2010 Alan Cox <alc@FreeBSD.org>

Properly synchronize access to the page's hold_count in vfs_vmio_release().

Reviewed by: kib


# b88b6c9d 02-May-2010 Alan Cox <alc@FreeBSD.org>

It makes no sense for vm_page_sleep_if_busy()'s helper, vm_page_sleep(),
to unconditionally set PG_REFERENCED on a page before sleeping. In many
cases, it's perfectly ok for the page to disappear, i.e., be reclaimed by
the page daemon, before the caller to vm_page_sleep() is reawakened.
Instead, we now explicitly set PG_REFERENCED in those cases where having
the page persist until the caller is awakened is clearly desirable. Note,
however, that setting PG_REFERENCED on the page is still only a hint,
and not a guarantee that the page should persist.


# 2965a453 29-Apr-2010 Kip Macy <kmacy@FreeBSD.org>

On Alan's advice, rather than do a wholesale conversion on a single
architecture from page queue lock to a hashed array of page locks
(based on a patch by Jeff Roberson), I've implemented page lock
support in the MI code and have only moved vm_page's hold_count
out from under page queue mutex to page lock. This changes
pmap_extract_and_hold on all pmaps.

Supported by: Bitgravity Inc.

Discussed with: alc, jeffr, and kib


# 113db2dd 24-Apr-2010 Jeff Roberson <jeff@FreeBSD.org>

- Merge soft-updates journaling from projects/suj/head into head. This
brings in support for an optional intent log which eliminates the need
for background fsck on unclean shutdown.

Sponsored by: iXsystems, Yahoo!, and Juniper.
With help from: McKusick and Peter Holm


# e747dbab 17-Apr-2010 Andriy Gapon <avg@FreeBSD.org>

MFC r205860,206097: correctly set b_offset for getblk(devvp)


# 1b4bc5f8 02-Apr-2010 Andriy Gapon <avg@FreeBSD.org>

bo_bsize: revert r205860 and take an alternative approch in getblk

In r205860 I missed the fact that there is code that strongly assumes
that devvp bo_bsize is equal to underlying provider's sectorsize.
In those places it is hard to obtain the sectorsize in an alternative
way if devvp bo_bsize is set to something else.
So, I am reverting bo_bsize assigment in g_vfs_open.
Instead, in getblk I use DEV_BSIZE block size for b_offset calculation
if vp is a disk vp as reported by vn_isdisk. This should coinside with
vp being a devvp.

Reported by: Mykola Dzham <i@levsha.me>
Tested by: Mykola Dzham <i@levsha.me>
Pointyhat to: avg
MFC after: 2 weeks
X-ToDo: convert bread(devvp) in all fs to use bo_bsize-d blocks


# 7ac5806b 19-Jul-2009 Konstantin Belousov <kib@FreeBSD.org>

When buffer write is failed, it is wrong for brelse() to invalidate
portion of the page that was written. Among other problems, this
page might be picked up by pagedaemon, with failed assertion in
vm_pageout_flush() about validity of the page.

Reported and tested by: pho
Approved by: re (kensmith)
MFC after: 3 weeks


# 0a276ede 07-Jun-2009 Alan Cox <alc@FreeBSD.org>

Eliminate an unused variable from allocbuf().

Eliminate the unnecessary setting of page valid bits from a non-VMIO buffer
in vm_hold_load_pages().


# 6864a18c 01-Jun-2009 Alan Cox <alc@FreeBSD.org>

Eliminate a comment describing code that was deleted over eight years ago.
Move another comment to its proper place. Fix a typo in a third comment.


# 1f176894 31-May-2009 Alan Cox <alc@FreeBSD.org>

nfs_write() can use the recently introduced vfs_bio_set_valid() instead of
vfs_bio_set_validclean(), thereby avoiding the page queues lock.

Garbage collect vfs_bio_set_validclean(). Nothing uses it any longer.


# 623469c9 29-May-2009 Alan Cox <alc@FreeBSD.org>

Modify vm_hold_load_pages() to allocate pages using VM_ALLOC_NOOBJ rather
than using the kernel object. This allows the elimination of page queues
locking from vm_hold_free_pages().


# cfeb7489 27-May-2009 Zachary Loafman <zml@FreeBSD.org>

fail(9) support:

Add support for kernel fault injection using KFAIL_POINT_* macros and
fail_point_* infrastructure. Add example fail point in vfs_bio.c to
simulate VM buf pressure.

Approved by: dfr (mentor)


# d422da9a 21-May-2009 John Baldwin <jhb@FreeBSD.org>

Only use the ABI compat shim for vfs.bufspace if the old buffer is smaller
than a long.

PR: amd64/134786
Submitted by: Emil Mikulic emikulic| gmail
MFC after: 3 days


# 1be52693 17-May-2009 Alan Cox <alc@FreeBSD.org>

Several changes to vfs_bio_clrbuf():

Provide a more descriptive comment.

Eliminate dead code. The page cannot possibly have PG_ZERO set.

Eliminate unnecessary blank lines.

Reviewed by: tegge


# 6e5982ca 17-May-2009 Alan Cox <alc@FreeBSD.org>

Introduce vfs_bio_set_valid() and use it from ffs_realloccg(). This
eliminates the misuse of vfs_bio_clrbuf() by ffs_realloccg().

In collaboration with: tegge


# 1c1b26f2 12-May-2009 Alan Cox <alc@FreeBSD.org>

Eliminate page queues locking from bufdone_finish() through the
following changes:

Rename vfs_page_set_valid() to vfs_page_set_validclean() to reflect
what this function actually does. Suggested by: tegge

Introduce a new version of vfs_page_set_valid() that does no more than
what the function's name implies. Specifically, it does not update
the page's dirty mask, and thus it does not require the page queues
lock to be held.

Update two of the three callers to the old vfs_page_set_valid() to
call vfs_page_set_validclean() instead because they actually require
the page's dirty mask to be cleared.

Introduce vm_page_set_valid().

Reviewed by: tegge


# c3d3fe63 10-May-2009 Alan Cox <alc@FreeBSD.org>

Revert CVS revision 1.94 (svn r16840). Current pmap implementations don't
suffer from the race condition that motivated revision 1.94. Consequently,
the work-around that was implemented by revision 1.94 is no longer needed.
Moreover, reverting this work-around eliminates the need for
vfs_busy_pages() to acquire the page queues lock when preparing a buffer
for read.

Reviewed by: tegge


# 8aeb69d0 17-Apr-2009 Alexander Kabaev <kan@FreeBSD.org>

Undo private changes that should never have been committed.


# 348496ad 17-Apr-2009 Alexander Kabaev <kan@FreeBSD.org>

More fallout from negative dotdot caching. Negative entries should
be removed from and reinserted to proper ncneg list.

Reported by: pho
Submitted by: kib


# 28a1b4eb 16-Apr-2009 Konstantin Belousov <kib@FreeBSD.org>

In flushbufqueues(), do not allocate sentinel buffer on the stack,
struct buf is large. Use sleeping malloc(9) call, and zero the allocated
buf as a debugging feature.


# 949af709 16-Apr-2009 Konstantin Belousov <kib@FreeBSD.org>

Export the number of times bufdaemon got help from the normal threads.


# 9b84ba1c 23-Mar-2009 John Baldwin <jhb@FreeBSD.org>

Improve the description of a few sysctls.

Submitted by: bde (partially)
MFC after: 3 days


# 76ed3c71 17-Mar-2009 Attilio Rao <attilio@FreeBSD.org>

Fix an old-standing bug that crept in along the several revisions:
B_DELWRI cleanup and vnode disassociation should happen just before to
assign the buffer to a queue.

Reported by: miwi, Volker <volker at vwsoft dot com>,
Ben Kaduk <minimarmot at gmail dot com>,
Christopher Mallon <christoph dot mallon at gmx dot de>
Tested by: lulf, miwi


# c1d8b5e8 16-Mar-2009 Konstantin Belousov <kib@FreeBSD.org>

Fix two issues with bufdaemon, often causing the processes to hang in
the "nbufkv" sleep.

First, ffs background cg group block write requests a new buffer for
the shadow copy. When ffs_bufwrite() is called from the bufdaemon due
to buffers shortage, requesting the buffer deadlock bufdaemon.
Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk
to not block while allocating the buffer, and return failure
instead. Add a flag argument to the geteblk to allow to pass the flags
to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer
allocation failed and either GB_NOWAIT_BD is specified, or geteblk()
is called from bufdaemon (or its helper, see below). In
ffs_bufwrite(), fall back to synchronous cg block write if shadow
block allocation failed.

Since r107847, buffer write assumes that vnode owning the buffer is
locked. The second problem is that buffer cache may accumulate many
buffers belonging to limited number of vnodes. With such workload,
quite often threads that own the mentioned vnodes locks are trying to
read another block from the vnodes, and, due to buffer cache
exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make
any substantial progress because the vnodes are locked.

Allow the threads owning vnode locks to help the bufdaemon by doing
the flush pass over the buffer cache before getnewbuf() is going to
uninterruptible sleep. Move the flushing code from buf_daemon() to new
helper function buf_do_flush(), that is called from getnewbuf(). The
number of buffers flushed by single call to buf_do_flush() from
getnewbuf() is limited by new sysctl vfs.flushbufqtarget. Prevent
recursive calls to buf_do_flush() by marking the bufdaemon and threads
that temporarily help bufdaemon by TDP_BUFNEED flag.

In collaboration with: pho
Reviewed by: tegge (previous version)
Tested by: glebius, yandex ...
MFC after: 3 weeks


# 060e911c 10-Mar-2009 John Baldwin <jhb@FreeBSD.org>

In the ABI shim for vfs.bufspace, rather than truncating values larger than
INT_MAX to INT_MAX, just go ahead and write out the full long to give an
error of ENOMEM to the user process.

Requested by: bde


# 38cce81a 10-Mar-2009 John Baldwin <jhb@FreeBSD.org>

Add an ABI compat shim for the vfs.bufspace sysctl for sysctl requests that
try to fetch it as an int rather than a long. If the current value is
greater than INT_MAX it reports a value of INT_MAX.


# 5bd65606 09-Mar-2009 John Baldwin <jhb@FreeBSD.org>

Adjust some variables (mostly related to the buffer cache) that hold
address space sizes to be longs instead of ints. Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see
such problems. There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf. I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.

Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.

MFC after: 1 month


# 8941aad1 06-Feb-2009 John Baldwin <jhb@FreeBSD.org>

Tweak the output of VOP_PRINT/vn_printf() some.
- Align the fifo output in fifo_print() with other vn_printf() output.
- Remove the leading space from lockmgr_printinfo() so its output lines up
in vn_printf().
- lockmgr_printinfo() now ends with a newline, so remove an extra newline
from vn_printf().


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 0d7935fd 10-Oct-2008 Attilio Rao <attilio@FreeBSD.org>

Remove the struct thread unuseful argument from bufobj interface.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()

and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()

Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.

As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP

Reviewed by: kib
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


# 52dfc8d7 16-Sep-2008 Konstantin Belousov <kib@FreeBSD.org>

Add the ffs structures introspection functions for ddb.
Show the b_dep value for the buffer in the show buffer command.
Add a comand to dump the dirty/clean buffer list for vnode.

Reviewed by: tegge
Tested and used by: pho
MFC after: 1 month


# 2bb4c6f9 19-Aug-2008 Konstantin Belousov <kib@FreeBSD.org>

In brelse, put the B_NEEDSGIANT buffer on the QUEUE_DIRTY_GIANT queue,
instead of QUEUE_DIRTY.

Tested by: pho
Reviewed by: attilio
MFC after: 3 days


# 14e69e48 19-Jul-2008 Alan Cox <alc@FreeBSD.org>

Eliminate dead code. (The commit message for revision 1.287 explains why
this code is dead.)


# 71072af5 27-Mar-2008 Attilio Rao <attilio@FreeBSD.org>

b_waiters cannot be adequately protected by the interlock because it is
dropped after the call to lockmgr() so just revert this approach using
something similar to the precedent one:
BUF_LOCKWAITERS() just checks if there are waiters (not the actual number
of them) and it is based on newly introduced lockmgr_waiters() which
returns if the lockmgr has waiters or not. The name has been choosen
differently by old lockwaiters() in order to not confuse them.

KPI results enriched by this commit so __FreeBSD_version bumping and
manpage update will be happening soon.
'struct buf' also changes, so kernel ABI is disturbed.

Bug found by: jeff
Approved by: jeff, kib


# 698b1a66 22-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Complete part of the unfinished bufobj work by consistently using
BO_LOCK/UNLOCK/MTX when manipulating the bufobj.
- Create a new lock in the bufobj to lock bufobj fields independently.
This leaves the vnode interlock as an 'identity' lock while the bufobj
is an io lock. The bufobj lock is ordered before the vnode interlock
and also before the mnt ilock.
- Exploit this new lock order to simplify softdep_check_suspend().
- A few sync related functions are marked with a new XXX to note that
we may not properly interlock against a non-zero bv_cnt when
attempting to sync all vnodes on a mountlist. I do not believe this
race is important. If I'm wrong this will make these locations easier
to find.

Reviewed by: kib (earlier diff)
Tested by: kris, pho (earlier diff)


# e7ffdf42 20-Mar-2008 Konstantin Belousov <kib@FreeBSD.org>

Reduce contention on the vnode interlock by not acquiring the BO_LOCK
around the check for the BV_BKGRDINPROG in the brelse() and bqrelse().
See the comment for the explanation why it is safe.

Tested by: pho
Submitted by: jeff


# 0169d126 21-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Reduce contention on the global bdonelock and bpinlock by using
a pool mutex to protect these sleep/wakeup/counter races. This
still is preferable to bloating each bio with a mtx.


# 237fdd78 16-Mar-2008 Robert Watson <rwatson@FreeBSD.org>

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink


# 7fbfba7b 01-Mar-2008 Attilio Rao <attilio@FreeBSD.org>

- Handle buffer lock waiters count directly in the buffer cache instead
than rely on the lockmgr support [1]:
* bump the waiters only if the interlock is held
* let brelvp() return the waiters count
* rely on brelvp() instead than BUF_LOCKWAITERS() in order to check
for the waiters number
- Remove a namespace pollution introduced recently with lockmgr.h
including lock.h by including lock.h directly in the consumers and
making it mandatory for using lockmgr.
- Modify flags accepted by lockinit():
* introduce LK_NOPROFILE which disables lock profiling for the
specified lockmgr
* introduce LK_QUIET which disables ktr tracing for the specified
lockmgr [2]
* disallow LK_SLEEPFAIL and LK_NOWAIT to be passed there so that it
can only be used on a per-instance basis
- Remove BUF_LOCKWAITERS() and lockwaiters() as they are no longer
used

This patch breaks KPI so __FreBSD_version will be bumped and manpages
updated by further commits. Additively, 'struct buf' changes results in
a disturbed ABI also.

[2] Really, currently there is no ktr tracing in the lockmgr, but it
will be added soon.

[1] Submitted by: kib
Tested by: pho, Andrea Barberio <insomniac at slackware dot it>


# 84887fa3 13-Feb-2008 Attilio Rao <attilio@FreeBSD.org>

- Add real assertions to lockmgr locking primitives.
A couple of notes for this:
* WITNESS support, when enabled, is only used for shared locks in order
to avoid problems with the "disowned" locks
* KA_HELD and KA_UNHELD only exists in the lockmgr namespace in order
to assert for a generic thread (not curthread) owning or not the
lock. Really, this kind of check is bogus but it seems very
widespread in the consumers code. So, for the moment, we cater this
untrusted behaviour, until the consumers are not fixed and the
options could be removed (hopefully during 8.0-CURRENT lifecycle)
* Implementing KA_HELD and KA_UNHELD (not surported natively by
WITNESS) made necessary the introduction of LA_MASKASSERT which
specifies the range for default lock assertion flags
* About other aspects, lockmgr_assert() follows exactly what other
locking primitives offer about this operation.

- Build real assertions for buffer cache locks on the top of
lockmgr_assert(). They can be used with the BUF_ASSERT_*(bp)
paradigm.

- Add checks at lock destruction time and use a cookie for verifying
lock integrity at any operation.

- Redefine BUF_LOCKFREE() in order to not use a direct assert but
let it rely on the aforementioned destruction time check.

KPI results evidently broken, so __FreeBSD_version bumping and
manpage update result necessary and will be committed soon.

Side note: lockmgr_assert() will be used soon in order to implement
real assertions in the vnode namespace replacing the legacy and still
bogus "VOP_ISLOCKED()" way.

Tested by: kris (earlier version)
Reviewed by: jhb


# d638e093 19-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

- Introduce the function lockmgr_recursed() which returns true if the
lockmgr lkp, when held in exclusive mode, is recursed
- Introduce the function BUF_RECURSED() which does the same for bufobj
locks based on the top of lockmgr_recursed()
- Introduce the function BUF_ISLOCKED() which works like the counterpart
VOP_ISLOCKED(9), showing the state of lockmgr linked with the bufobj

BUF_RECURSED() and BUF_ISLOCKED() entirely replace the usage of bogus
BUF_REFCNT() in a more explicative and SMP-compliant way.
This allows us to axe out BUF_REFCNT() and leaving the function
lockcount() totally unused in our stock kernel. Further commits will
axe lockcount() as well as part of lockmgr() cleanup.

KPI results, obviously, broken so further commits will update manpages
and freebsd version.

Tested by: kris (on UFS and NFS)


# 22db15c0 13-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


# cb05b60a 09-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


# c94a7cac 29-Dec-2007 Warner Losh <imp@FreeBSD.org>

Rather than not redirting the bp when we get ENXIO, only redirty it
when the error is EIO. This catches a much larger class of errors
that are unlikely to succeed if retried.

Submitted by: bde


# b27aa20e 27-Dec-2007 Warner Losh <imp@FreeBSD.org>

A partial solution to some of the 'pull the umass device with a
mounted FS' problems. These are more along the lines of 'avoiding an
avoidable panic' than a complete solution to removable devices. We
now close the barn door after the horse has gotten lose and has been
hit by a truck, as it were. The barn no longer catches fire in this
case, but the horse is still dead :-).

The vfs_bio.c fix causes us not to put a failed write back into the
dirty pool if the error returned was ENXIO. In that case, the buffer
is treated like any other clean buffer that's being retured. ENXIO
means the device isn't there anymore and will never be there again in
the future, so retrying is futile.

The vfs_mount.c fix treats 'ENXIO' as success for unmounting a file
system. If the device is gone, retrying later won't help and we'll
never be able to unmount the device.

These two are part of a larger patch set submitted by the author. The
other patches will be forth coming. I added comments to these two
patches.

Submitted by: Henrik Gulbrandsen
Reviewed by: phk@
PR: usb/46176 (partial)


# 30418ed3 01-Dec-2007 Alan Cox <alc@FreeBSD.org>

Eliminate vfs_page_set_valid()'s unused argument.


# 3745c395 20-Oct-2007 Julian Elischer <julian@FreeBSD.org>

Rename the kthread_xxx (e.g. kthread_create()) calls
to kproc_xxx as they actually make whole processes.
Thos makes way for us to add REAL kthread_create() and friends
that actually make theads. it turns out that most of these
calls actually end up being moved back to the thread version
when it's added. but we need to make this cosmetic change first.

I'd LOVE to do this rename in 7.0 so that we can eventually MFC the
new kthread_xxx() calls.


# 718a600b 26-Sep-2007 Ruslan Ermilov <ru@FreeBSD.org>

Fix the description of the formula used to autosize the number of
buffers in the buffer cache.

Approved by: re (kensmith)


# 7bfda801 25-Sep-2007 Alan Cox <alc@FreeBSD.org>

Change the management of cached pages (PQ_CACHE) in two fundamental
ways:

(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.

This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.

Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.

(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.

Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.

Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)


# 55b5660d 09-Jun-2007 Marcel Moolenaar <marcel@FreeBSD.org>

Work around an integer overflow in expression `3 * maxbufspace / 4',
when maxbufspace is larger than INT_MAX / 3. The overflow causes a
hard hang on ia64 when physical memory is sufficiently large (8GB).


# 7b8c8b85 08-Jun-2007 Xin LI <delphij@FreeBSD.org>

In getblk(), before gbincore(), use BO_LOCK directly when locking
the bufobj, rather than using VI_LOCK, like what was done with
revision 1.453.


# 1c4bcd05 31-May-2007 Jeff Roberson <jeff@FreeBSD.org>

- Move rusage from being per-process in struct pstats to per-thread in
td_ru. This removes the requirement for per-process synchronization in
statclock() and mi_switch(). This was previously supported by
sched_lock which is going away. All modifications to rusage are now
done in the context of the owning thread. reads proceed without locks.
- Aggregate exiting threads rusage in thread_exit() such that the exiting
thread's rusage is not lost.
- Provide a new routine, rufetch() to fetch an aggregate of all rusage
structures from all threads in a process. This routine must be used
in any place requiring a rusage from a process prior to it's exit. The
exited process's rusage is still available via p_ru.
- Aggregate tick statistics only on demand via rufetch() or when a thread
exits. Tick statistics are kept in the thread and protected by sched_lock
until it exits.

Initial patch by: attilio
Reviewed by: attilio, bde (some objections), arch (mostly silent)


# 2feb50bf 31-May-2007 Attilio Rao <attilio@FreeBSD.org>

Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)


# 222d0195 18-May-2007 Jeff Roberson <jeff@FreeBSD.org>

- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts. This can be used to abstract away pcpu details but also changes
to use atomics for all counters now. This means sched lock is no longer
responsible for protecting counts in the switch routines.

Contributed by: Attilio Rao <attilio@FreeBSD.org>


# 8e68f804 24-Apr-2007 Konstantin Belousov <kib@FreeBSD.org>

Disable nesting of BOP_BDFLUSH(). VOP_FSYNC() call in bdwrite() could
result in bdwrite() being reentered, thus causing infinite recursion.

Reported and tested by: Peter Holm
Reviewed by: tegge
MFC after: 2 weeks


# 2404c938 29-Mar-2007 Wojciech A. Koszek <wkoszek@FreeBSD.org>

vm_map_delete should be used only internally, by the VM subsystem. Replace
it with vm_map_remove, which not only embeds additional check, but also
takes care of locking.

Reviewed by: alc
Approved by: alc, cognet (mentor)


# 0c9c08dd 25-Mar-2007 Kris Kennaway <kris@FreeBSD.org>

Correct a comment typo


# 486a9414 07-Mar-2007 Julian Elischer <julian@FreeBSD.org>

Instead of doing comparisons using the pcpu area to see if
a thread is an idle thread, just see if it has the IDLETD
flag set. That flag will probably move to the pflags word
as it's permenent and never chenges for the life of the
system so it doesn't need locking.


# 74f094f6 22-Feb-2007 Xin LI <delphij@FreeBSD.org>

Use LIST_EMPTY() instead of unrolled version (LIST_FIRST() [!=]= NULL)


# 2cc7d26f 23-Jan-2007 Konstantin Belousov <kib@FreeBSD.org>

Cylinder group bitmaps and blocks containing inode for a snapshot
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.

Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.

Reviewed by: tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by: Peter Holm
X-MFC after: 3 weeks (if ever: it changes ABI)


# cc570216 20-Dec-2006 Konstantin Belousov <kib@FreeBSD.org>

In rev. 1.514, iodone on async buffer may happen before code checks the
vnode v_flag. For cluster buffers this would result in dereferencing NULL
b_vp. To prevent the panic, cache relevant vnode flag before calling
bstrategy.

Reported by: Peter Holm, kris
Tested by: Peter Holm
Reviewed by: tegge
Pointy hat to: kib


# 3b7b5496 14-Dec-2006 Konstantin Belousov <kib@FreeBSD.org>

Resolve two deadlocks that could be caused by busy md device backed
by vnode. Allow for md thread and the thread that owns lock on vnode
backing the md device to do the write even when runningbufspace is
exhausted.

Tested by: Peter Holm
Reviewed by: tegge
MFC after: 2 weeks


# 0c2b04b4 28-Oct-2006 Alan Cox <alc@FreeBSD.org>

Refactor vfs_setdirty(), creating vfs_setdirty_locked_object().

Call vfs_setdirty_locked_object() from vfs_busy_pages() instead of
vfs_setdirty(), thereby eliminating a second acquisition and release
of the same vm object lock.


# 20ed1b5b 28-Oct-2006 Alan Cox <alc@FreeBSD.org>

In bufdone_finish() restrict the acquisition and release of the page
queues lock to BIO_READ operations. Recent changes to the implementation
of the per-page flags have eliminated the need for the page queues lock
in the other cases.


# 9af80719 21-Oct-2006 Alan Cox <alc@FreeBSD.org>

Replace PG_BUSY with VPO_BUSY. In other words, changes to the page's
busy flag, i.e., VPO_BUSY, are now synchronized by the per-vm object
lock instead of the global page queues lock.


# 04aa807c 01-Oct-2006 Tor Egge <tegge@FreeBSD.org>

If the buffer lock has waiters after the buffer has changed identity then
getnewbuf() needs to drop the buffer in order to wake waiters that might
sleep on the buffer in the context of the old identity.


# 5786be7c 09-Aug-2006 Alan Cox <alc@FreeBSD.org>

Introduce a field to struct vm_page for storing flags that are
synchronized by the lock on the object containing the page.

Transition PG_WANTED and PG_SWAPINPROG to use the new field,
eliminating the need for holding the page queues lock when setting
or clearing these flags. Rename PG_WANTED and PG_SWAPINPROG to
VPO_WANTED and VPO_SWAPINPROG, respectively.

Eliminate the assertion that the page queues lock is held in
vm_page_io_finish().

Eliminate the acquisition and release of the page queues lock
around calls to vm_page_io_finish() in kern_sendfile() and
vfs_unbusy_pages().


# ab83ac42 08-Aug-2006 Alan Cox <alc@FreeBSD.org>

Reduce the scope of the page queues lock in vfs_busy_pages() now that
vm_page_sleep_if_busy() no longer requires the caller to hold the page
queues lock.


# af51d7bf 21-Jul-2006 Alan Cox <alc@FreeBSD.org>

Eliminate OBJ_WRITEABLE. It hasn't been used in a long time.


# 4b24e421 04-Apr-2006 Jeff Roberson <jeff@FreeBSD.org>

- Properly check against B_DELWRI and B_NEEDSGIANT. This check was
incorrectly written and caused some !NEEDSGIANT buffers to be put in
the NEEDSGIANT queue.

Sponsored by: Isilon Systems, Inc.


# 084d64ac 30-Mar-2006 Jeff Roberson <jeff@FreeBSD.org>

- Add the B_NEEDSGIANT flag which is only set if the vnode that owns a buf
requires Giant. It is set in bgetvp and cleared in brelvp.
- Create QUEUE_DIRTY_GIANT for dirty buffers that require giant.
- In the buf daemon, only grab giant when processing QUEUE_DIRTY_GIANT and
only if we think there are buffers in that queue.

Sponsored by: Isilon Systems, Inc.


# 96c0381f 21-Mar-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Destroy "bip" bio in error case.

Found by: Coverity Prevent analysis tool
Coverity ID: 795
MFC after: 3 days


# c7822632 02-Feb-2006 Tor Egge <tegge@FreeBSD.org>

For low memory situations, non-VMIO buffers didnt't release pages back to
the system when brelse() was called with B_RELBUF set on the buffer. This
could be a problem when the system was low on memory, had many buffers on
QUEUE_EMPTYKVA and started to traverse directories. For each getnewbuf(),
pages were allocated from the system, driving the free reserve downwards.
For each brelse(), the system put the buffer on QUEUE_CLEAN, with B_INVAL
set.

This commit changes the semantics of B_RELBUF to also free pages from
non-VMIO buffers.

Reviewed by: alc


# bb53e2bf 22-Jan-2006 Alan Cox <alc@FreeBSD.org>

Remove an unnecessary call to pmap_remove_all(). The given page is not
mapped because its contents are invalid.

Reviewed by: tegge


# dffaf91a 16-Jan-2006 Tor Egge <tegge@FreeBSD.org>

Set flag in needsbuffer while still holding bqlock to avoid lost wakeup.


# ef39c05b 31-Dec-2005 Alexander Leidinger <netchild@FreeBSD.org>

MI changes:
- provide an interface (macros) to the page coloring part of the VM system,
this allows to try different coloring algorithms without the need to
touch every file [1]
- make the page queue tuning values readable: sysctl vm.stats.pagequeue
- autotuning of the page coloring values based upon the cache size instead
of options in the kernel config (disabling of the page coloring as a
kernel option is still possible)

MD changes:
- detection of the cache size: only IA32 and AMD64 (untested) contains
cache size detection code, every other arch just comes with a dummy
function (this results in the use of default values like it was the
case without the autotuning of the page coloring)
- print some more info on Intel CPU's (like we do on AMD and Transmeta
CPU's)

Note to AMD owners (IA32 and AMD64): please run "sysctl vm.stats.pagequeue"
and report if the cache* values are zero (= bug in the cache detection code)
or not.

Based upon work by: Chad David <davidc@acns.ab.ca> [1]
Reviewed by: alc, arch (in 2004)
Discussed with: alc, Chad David, arch (in 2004)


# 6951bea6 06-Dec-2005 Craig Rodrigues <rodrigc@FreeBSD.org>

Changes imported from XFS for FreeBSD project:
- add fields to struct buf (needed by XFS)
- 3 private fields: b_fsprivate1, b_fsprivate2, b_fsprivate3
- b_pin_count, count of pinned buffer

- add new B_MANAGED flag
- add breada() function to initiate asynchronous I/O on read-ahead blocks.
- add bufdone_finish(), bpin(), bunpin_wait() functions

Patches provided by: kan
Reviewed by: phk
Silence on: arch@


# 5bb84bc8 31-Oct-2005 Robert Watson <rwatson@FreeBSD.org>

Normalize a significant number of kernel malloc type names:

- Prefer '_' to ' ', as it results in more easily parsed results in
memory monitoring tools such as vmstat.

- Remove punctuation that is incompatible with using memory type names
as file names, such as '/' characters.

- Disambiguate some collisions by adding subsystem prefixes to some
memory types.

- Generally prefer lower case to upper case.

- If the same type is defined in multiple architecture directories,
attempt to use the same name in additional cases.

Not all instances were caught in this change, so more work is required to
finish this conversion. Similar changes are required for UMA zone names.


# 8272da31 09-Oct-2005 Tor Egge <tegge@FreeBSD.org>

Release clean buffer with wrong size and no dependencies also for non-VMIO
case.


# bd3c2d86 30-Sep-2005 Don Lewis <truckman@FreeBSD.org>

Un-staticize waitrunningbufspace() and call it before returning from
ffs_copyonwrite() if any async writes were launched.

Restore the threads previous TDP_NORUNNINGBUF state before returning
from ffs_copyonwrite().


# 6c8b634f 29-Sep-2005 Don Lewis <truckman@FreeBSD.org>

Un-staticize runningbufwakeup() and staticize updateproc.

Add a new private thread flag to indicate that the thread should
not sleep if runningbufspace is too large.

Set this flag on the bufdaemon and syncer threads so that they skip
the waitrunningbufspace() call in bufwrite() rather than than
checking the proc pointer vs. the known proc pointers for these two
threads. A way of preventing these threads from being starved for
I/O but still placing limits on their outstanding I/O would be
desirable.

Set this flag in ffs_copyonwrite() to prevent bufwrite() calls from
blocking on the runningbufspace check while holding snaplk. This
prevents snaplk from being held for an arbitrarily long period of
time if runningbufspace is high and greatly reduces the contention
for snaplk. The disadvantage is that ffs_copyonwrite() can start
a large amount of I/O if there are a large number of snapshots,
which could cause a deadlock in other parts of the code.

Call runningbufwakeup() in ffs_copyonwrite() to decrement runningbufspace
before attempting to grab snaplk so that I/O requests waiting on
snaplk are not counted in runningbufspace as being in-progress.
Increment runningbufspace again before actually launching the
original I/O request.

Prior to the above two changes, the system could deadlock if enough
I/O requests were blocked by snaplk to prevent runningbufspace from
falling below lorunningspace and one of the bawrite() calls in
ffs_copyonwrite() blocked in waitrunningbufspace() while holding
snaplk.

See <http://www.holm.cc/stress/log/cons143.html>


# d41c4674 29-Sep-2005 Peter Edwards <peadar@FreeBSD.org>

Close a race in biodone(), whereby the bio_done field of the passed
bio may have been freed and reassigned by the wakeup before being
tested after releasing the bdonelock.

There's a non-zero chance this is the cause of a few of the crashes
knocking around with biodone() sitting in the stack backtrace.

Reviewed By: phk@


# 9e2aaec1 02-Aug-2005 Jeff Roberson <jeff@FreeBSD.org>

- Use lockmgr_printinfo rather than rolling our own. This introduces a
slight problem by using printf instead of db_printf however
'show lockedvnods' does the same so I believe it is ok for now.


# ec9c9e73 20-Jul-2005 Alan Cox <alc@FreeBSD.org>

Eliminate inconsistency in the setting of the B_DONE flag. Specifically,
make the b_iodone callback responsible for setting it if it is needed.
Previously, it was set unconditionally by bufdone() without holding
whichever lock is shared by the b_iodone callback and the corresponding
top-half function. Consequently, in a race, the top-half function could
conclude that operation was done before the b_iodone callback finished.
See, for example, aio_physwakeup() and aio_fphysio().

Note: I don't believe that the other, more widely-used b_iodone callbacks
are affected.

Discussed with: jeff
Reviewed by: phk
MFC after: 2 weeks


# 7a06fe49 14-Jun-2005 Jeff Roberson <jeff@FreeBSD.org>

- Add and enhance asserts related to the wrong bufobj panic.

Sponsored by: Isilon Systems, Inc.
Approved by: re (blanket vfs)


# 748c92fb 12-Jun-2005 Jeff Roberson <jeff@FreeBSD.org>

- Split one KASSERT in bremfree() into two to aid in debugging.

Sponsored by: Isilon Systems, Inc.


# cc3149b1 10-Jun-2005 Brian Feldman <green@FreeBSD.org>

Fix a serious deadlock with the NFS client. Given a large enough
atomic write request, it can fill the buffer cache with the entirety
of that write in order to handle retries. However, it never drops
the vnode lock, or else it wouldn't be atomic, so it ends up waiting
indefinitely for more buf memory that cannot be gotten as it has it
all, and it waits in an uncancellable state.

To fix this, hibufspace is exported and scaled to a reasonable
fraction. This is used as the limit of how much of an atomic write
request by the NFS client will be handled asynchronously. If the
request is larger than this, it will be turned into a synchronous
request which won't deadlock the system. It's possible this value is
far off from what is required by some, so it shall be tunable as soon
as mount_nfs(8) learns of the new field.

The slowdown between an asynchronous and a synchronous write on NFS
appears to be on the order of 2x-4x.

General nod by: gad
MFC after: 2 weeks
More testing: wes
PR: kern/79208


# a3d239bc 08-Jun-2005 Jeff Roberson <jeff@FreeBSD.org>

- My sub-par public school education has been exposed. s/sentinal/sentinel/

Noticed by: Emil Mikulic


# 9e879a5e 08-Jun-2005 Jeff Roberson <jeff@FreeBSD.org>

- Under heavy IO load the buf daemon can run for many hundereds of
milliseconds due to what is essentially n^2 algorithmic complexity. This
change makes the algorithm N*2 instead. This heavy processing manifested
itself as skipping in audio and video playback due to the long scheduling
latencies and contention on giant by pcm.
- flushbufqueues() is now responsible for flushing multiple buffers
rather than one at a time. This allows us to save our progress in the
list by using a sentinal. We must do the numdirtywakeup() and
waitrunningbufspace() here now rather than in buf_daemon().
- Also add a uio_yield() after we have processed the list once for bufs
without deps and again for bufs with deps. This is to release Giant
and allow any other giant locked code to proceed.

Tested by: Many users on current@
Revealed by: schedgraph traces sent by Emil Mikulic & Anthony Ginepro


# 1f22a07a 30-May-2005 Jeff Roberson <jeff@FreeBSD.org>

- Add bufobj_wrefl() to add a write ref to a bufobj that is already locked.


# 4a723b36 29-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Remove long dead splbio() calls and comments relating to the old
synchronization mechanism.


# ba4f7c70 30-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Don't acquire Giant before calling b_biodone, individual consumers are
now required to do so themselves.

Sponsored by: Isilon Systems, Inc.


# 0d12524b 21-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Add two KASSERTs to prevent us from recycling a buf that is still on a
bufobj list.

Sponsored by: Isilon Systems, Inc.


# 6c759f35 24-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Add information about the buf lock to db_show_buffer.
- Add a 'show lockedbufs' command that is similar to show lockedvnods.

Sponsored by: Isilon Systems, Inc.


# ec346d10 08-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Lock access to the buffer_map with the vm_map lock. In 4.x this was
done with splbio, in 5.x this was done with Giant.

Discussed with: alc
Reported by: julian, pho


# 1ba21282 09-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Make various vnode related functions static


# 5c18d18b 09-Feb-2005 Jeff Roberson <jeff@FreeBSD.org>

- Add more information to the getnewbuf() recycling KTR.

Sponsored by: Isilon Systems, Inc.


# b56dc9a7 08-Feb-2005 Jeff Roberson <jeff@FreeBSD.org>

- Remove an invalid KASSERT added in recent background write reshuffling.

Sponsored by: Isilon Systems, Inc.


# dd19a799 08-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Background writes are entirely an FFS/Softupdates thing.

Give FFS vnodes a specific bufwrite method which contains all the
background write stuff and then calls into the default bufwrite()
for the rest of the job.

Remove all the background write related stuff from the normal bufwrite.

This drags the softdep_move_dependencies() back into FFS.

Long term, it is worth looking at simply copying the data into
allocated memory and issuing the bio directly and not create the
"shadow buf" in the first place (just like copy-on-write is done
in snapshots for instance). I don't think we really gain anything
but complexity from doing this with a buf.


# 83644466 04-Feb-2005 Jeff Roberson <jeff@FreeBSD.org>

- Don't release BKGRDINPROG until after we've bufdone'd the copy.

Sponsored by: Isilon Systems, Inc.


# bd8d684f 28-Jan-2005 Jeff Roberson <jeff@FreeBSD.org>

- Don't drop the wref on the bufobj until after bufdone() has completed.
Without this, threads waiting in bufobj_wwait() may wakeup prior to
bufdone() completing.

Sponsored by: Isilon Systems, Inc.


# 8516dd18 24-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Don't use VOP_GETVOBJECT, use vp->v_object directly.


# 35764be3 24-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Kill the VV_OBJBUF and test the v_object for NULL instead.


# 71ddd673 24-Jan-2005 Jeff Roberson <jeff@FreeBSD.org>

- Add CTR calls to trace the lifecycle of a buffer.
- Remove some KASSERTs which are invalid if the appropriate lock is
not held.
- Slightly restructure bremfree() so that it is more sane.
- Change the flush code in bdwrite() to avoid acquiring a mutex
whenever possible.
- Change the flush code in bdwrite() to avoid holding the bufobj mutex
while calling buf_countdeps(). This introduces a lock-order
relationship with the softdep lock that can not otherwise be resolved.
- Don't set B_DONE until bufdone() is complete, otherwise another
processor may believe the buf is done before it is.
- Only acquire Giant if the caller has set b_iodone. Don't grab giant
around normal bufdone() calls.

Sponsored By: Isilon Systems, Inc.


# 6ef8480a 11-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Add BO_SYNC() and add a default which uses the secret vnode pointer
and VOP_FSYNC() for now.


# 8df6bac4 11-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC().

I'm not sure why a credential was added to these in the first place, it is
not used anywhere and it doesn't make much sense:

The credentials for syncing a file (ability to write to the
file) should be checked at the system call level.

Credentials for syncing one or more filesystems ("none")
should be checked at the system call level as well.

If the filesystem implementation needs a particular credential
to carry out the syncing it would logically have to the
cached mount credential, or a credential cached along with
any delayed write data.

Discussed with: rwatson


# b646893f 18-Nov-2004 Jeff Roberson <jeff@FreeBSD.org>

- Eliminate the acquisition and release of the bqlock in bremfree() by
setting the B_REMFREE flag in the buf. This is done to prevent lock order
reversals with code that must call bremfree() with a local lock held.
This also reduces overhead by removing two lock operations per buf for
fsync() and similar.
- Check for the B_REMFREE flag in brelse() and bqrelse() after the bqlock
has been acquired so that we may remove ourself from the free-list.
- Provide a bremfreef() function to immediately remove a buf from a
free-list for use only by NFS. This is done because the nfsclient code
overloads the b_freelist queue for its own async. io queue.
- Simplify the numfreebuffers accounting by removing a switch statement
that executed the same code in every possible case.
- getnewbuf() can encounter locked bufs on free-lists once Giant is removed.
Remove a panic associated with this condition and delay asserts that
inspect the buf until after it is locked.

Reviewed by: phk
Sponsored by: Isilon Systems, Inc.


# 6e67e2a7 04-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Retire b_magic now, we have the bufobj containing the same hint.


# 9f7a3028 04-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Change buf->b_object to buf->b_bufobj->bo_object

some whitespace fixes.


# 9bc4d9a4 04-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

whitespace


# c5690651 04-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove buf->b_dev field.


# d19ef814 03-Nov-2004 Alan Cox <alc@FreeBSD.org>

The synchronization provided by vm object locking has eliminated the
need for most calls to vm_page_busy(). Specifically, most calls to
vm_page_busy() occur immediately prior to a call to vm_page_remove().
In such cases, the containing vm object is locked across both calls.
Consequently, the setting of the vm page's PG_BUSY flag is not even
visible to other threads that are following the synchronization
protocol.

This change (1) eliminates the calls to vm_page_busy() that
immediately precede a call to vm_page_remove() or functions, such as
vm_page_free() and vm_page_rename(), that call it and (2) relaxes the
requirement in vm_page_remove() that the vm page's PG_BUSY flag is
set. Now, the vm page's PG_BUSY flag is set only when the vm object
lock is released while the vm page is still in transition. Typically,
this is when it is undergoing I/O.


# 0cbda9df 29-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove the last call in the system to VOP_SPECSTRATEGY(): We can no
longer come through the VNODE layer to the disks since all the filesystems
now go via geom_vfs to GEOM.


# 6afb3b1c 29-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Give dev_strategy() an explict cdev argument in preparation for removing
buf->b-dev.

Put a bio between the buf passed to dev_strategy() and the device driver
strategy routine in order to not clobber fields in the buf.

Assert copyright on vfs_bio.c and update copyright message to canonical
text. There is no legal difference between John Dysons two-clause
abbreviated BSD license and the canonical text.


# c5995e45 28-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Lock bp->b_bufobj->b_object instead of bp->b_object


# 6e77a041 26-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

The island council met and voted buf_prewrite() home.

Give ffs it's own bufobj->bo_ops vector and create a private strategy
routine, (currently misnamed for forwards compatibility), which is
just a copy of the generic bufstrategy routine except we call
softdep_disk_prewrite() directly instead of through the buf_prewrite()
indirection.

Teach UFS about the need for softdep_disk_prewrite() and call the
function directly in FFS.

Remove buf_prewrite() from the default bufstrategy() and from the
global bio_ops method vector.


# 5d9d81e7 26-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Put the I/O block size in bufobj->bo_bsize.

We keep si_bsize_phys around for now as that is the simplest way to pull
the number out of disk device drivers in devfs_open(). The correct solution
would be to do an ioctl(DIOCGSECTORSIZE), but the point is probably mooth
when filesystems sit on GEOM, so don't bother for now.


# cd9c0da8 26-Oct-2004 Alan Cox <alc@FreeBSD.org>

Hold the lock on the containing vm object when calling
vm_page_sleep_if_busy().


# ee1d0eb3 25-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove vnode->v_bsize. This was a dead-end.


# a50b7054 25-Oct-2004 Alan Cox <alc@FreeBSD.org>

Use VM_ALLOC_NOBUSY to eliminate vm_page_wakeup() calls and the acquisition
and release of the global page queues lock required to make the call.

Remove GIANT_REQUIRED from vm_hold_free_pages(). All of its VM operations
are properly synchronized.


# 4dcd0ac4 25-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Collapse vnode->v_object and buf->b_object into bufobj->bo_object.


# b792bebe 24-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Move the buffer method vector (buf->b_op) to the bufobj.

Extend it with a strategy method.

Add bufstrategy() which do the usual VOP_SPECSTRATEGY/VOP_STRATEGY
song and dance.

Rename ibwrite to bufwrite().

Move the two NFS buf_ops to more sensible places, add bufstrategy
to them.

Add inlines for bwrite() and bstrategy() which calls through
buf->b_bufobj->b_ops->b_{write,strategy}().

Replace almost all VOP_STRATEGY()/VOP_SPECSTRATEGY() calls with bstrategy().


# 494eb176 22-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Add b_bufobj to struct buf which eventually will eliminate the need for b_vp.

Initialize b_bufobj for all buffers.

Make incore() and gbincore() take a bufobj instead of a vnode.

Make inmem() local to vfs_bio.c

Change a lot of VI_[UN]LOCK(bp->b_vp) to BO_[UN]LOCK(bp->b_bufobj)
also VI_MTX() to BO_MTX(),

Make buf_vlist_add() take a bufobj instead of a vnode.

Eliminate other uses of bp->b_vp where bp->b_bufobj will do.

Various minor polishing: remove "register", turn panic into KASSERT,
use new function declarations, TAILQ_FOREACH_SAFE() etc.


# a76d8f4e 21-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Move the VI_BWAIT flag into no bo_flag element of bufobj and call it BO_WWAIT

Add bufobj_wref(), bufobj_wdrop() and bufobj_wwait() to handle the write
count on a bufobj. Bufobj_wdrop() replaces vwakeup().

Use these functions all relevant places except in ffs_softdep.c where
the use if interlocked_sleep() makes this impossible.

Rename b_vnbufs to b_bobufs now that we touch all the relevant files anyway.


# 6230ce6a 23-Sep-2004 Poul-Henning Kamp <phk@FreeBSD.org>

use dev_re[fl]thread() rather than home rolled versions.


# 1a52a73d 23-Sep-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Eliminate DEV_STRATEGY() macro: call dev_strategy() directly.

Make dev_strategy() handle errors and departing devices properly.


# a0e78d2e 23-Sep-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Do not refcount the cdevsw, but rather maintain a cdev->si_threadcount
of the number of threads which are inside whatever is behind the
cdevsw for this particular cdev.

Make the device mutex visible through dev_lock() and dev_unlock().
We may want finer granularity later.

Replace spechash_mtx use with dev_lock()/dev_unlock().


# 08dbd671 15-Sep-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unused B_WRITEINPROG flag


# 4095f485 15-Sep-2004 Poul-Henning Kamp <phk@FreeBSD.org>

undent some functions a bit.


# ab19cad7 15-Sep-2004 Poul-Henning Kamp <phk@FreeBSD.org>

stylistic polishing.


# 883d3c0c 13-Sep-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove the buffercache/vnode side of BIO_DELETE processing in
preparation for integration of p4::phk_bufwork. In the future,
local filesystems will talk to GEOM directly and they will consequently
be able to issue BIO_DELETE directly. Since the removal of the fla
driver, BIO_DELETE has effectively been a no-op anyway.


# cf95b5c3 25-Jul-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Eliminate unused second argument to reassignbuf() and simplify it
accordingly.


# a3d57cfb 25-Jul-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Neuter this warning for now, I think I know the remaining issues.


# d8582da6 17-Jul-2004 Alan Cox <alc@FreeBSD.org>

Remove GIANT_REQUIRED from vmapbuf().


# 0f015868 06-Jul-2004 Peter Edwards <peadar@FreeBSD.org>

Fix bug introduced in rev 1.434:

When avoiding the zeroing of "bogus_page" when it appears in a buf,
be sure to advance the pointers into the data for successive pages.

The bug caused file corruption when read(2)ing from a "hole" in a
file where a previous page of the read block had already been faulted
in: fsx tripped up on this pretty quickly. The particular access
pattern is probably pretty unusual, so other applications probably
wouldn't have had problems, but you'd never know.

Reviewed By: alc@


# 7f6599fe 04-Jul-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Make the last commit handle non-phk root devices better.


# 5908d366 04-Jul-2004 Stefan Farfeleder <stefanf@FreeBSD.org>

Consistently use __inline instead of __inline__ as the former is an empty macro
in <sys/cdefs.h> for compilers without support for inline.


# 1cbb1e02 03-Jul-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Blocksize for I/O should be a property of the vnode and not found by groping
around in the vnodes surroundings when we allocate a block.

Assign a blocksize when we create a vnode, and yell a warning (and ignore it)
if we got the wrong size.

Please email all such warnings to me.


# cfa5e80a 03-Jul-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove stale comment


# f3732fd1 17-Jun-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Second half of the dev_t cleanup.

The big lines are:
NODEV -> NULL
NOUDEV -> NODEV
udev_t -> dev_t
udev2dev() -> findcdev()

Various minor adjustments including handling of userland access to kernel
space struct cdev etc.


# 89c9c53d 16-Jun-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Do the dreaded s/dev_t/struct cdev */
Bump __FreeBSD_version accordingly.


# ec1100fc 08-May-2004 Alan Cox <alc@FreeBSD.org>

Avoid pointless zeroing of the bogus page in vfs_bio_clrbuf().

Suggested by: tegge@ (from October of last year)


# 5a324893 05-May-2004 Alan Cox <alc@FreeBSD.org>

Make vm_page's PG_ZERO flag immutable between the time of the page's
allocation and deallocation. This flag's principal use is shortly after
allocation. For such cases, clearing the flag is pointless. The only
unusual use of PG_ZERO is in vfs_bio_clrbuf(). However, allocbuf() never
requests a prezeroed page. So, vfs_bio_clrbuf() never sees a prezeroed
page.

Reviewed by: tegge@


# 30a05802 11-Mar-2004 Dag-Erling Smørgrav <des@FreeBSD.org>

Replace a manual check of a VMIO candidate with vn_canvmio(). This
silences an annoying warning in getblk() when VMIO'ing on a directory
vnode, which can happen when vfs.vmiodirenable is 1.

Bring the warning message in line with reality at the same time.

Submitted by: hmp


# ceb58ca5 11-Mar-2004 Poul-Henning Kamp <phk@FreeBSD.org>

When I was a kid my work table was one cluttered mess an cleaning it up
were a rather overwhelming task. I soon learned that if you don't know
where you're going to store something, at least try to pile it next to
something slightly related in the hope that a pattern emerges.

Apply the same principle to the ffs/snapshot/softupdates code which have
leaked into specfs: Add yet a buf-quasi-method and call it from the
only two places I can see it can make a difference and implement the
magic in ffs_softdep.c where it belongs.

It's not pretty, but at least it's one less layer violated.


# 4d453ef1 11-Mar-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Properly vector all bwrite() and BUF_WRITE() calls through the same path
and s/BUF_WRITE()/bwrite()/ since it now does the same as bwrite().


# 3eba15c1 06-Mar-2004 Alan Cox <alc@FreeBSD.org>

Remove GIANT_REQUIRED from vunmapbuf().


# cd690b60 21-Feb-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Device megapatch 6/6:

This is what we came here for: Hang dev_t's from their cdevsw,
refcount cdevsw and dev_t and generally keep track of things a lot
better than we used to:

Hold a cdevsw reference around all entrances into the device driver,
this will be necessary to safely determine when we can unload driver
code.

Hold a dev_t reference while the device is open.

KASSERT that we do not enter the driver on a non-referenced dev_t.

Remove old D_NAG code, anonymous dev_t's are not a problem now.

When destroy_dev() is called on a referenced dev_t, move it to
dead_cdevsw's list. When the refcount drops, free it.

Check that cdevsw->d_version is correct. If not, set all methods
to the dead_*() methods to prevent entrance into driver. Print
warning on console to this effect. The device driver may still
explode if it is also incompatible with newbus, but in that case
we probably didn't get this far in the first place.


# c5aebf38 07-Feb-2004 Alan Cox <alc@FreeBSD.org>

swp_pager_async_iodone() no longer requires Giant. Modify bufdone()
and swapgeom_done() to perform swp_pager_async_iodone() without Giant.

Reviewed by: tegge


# 96a7b422 20-Dec-2003 Alan Cox <alc@FreeBSD.org>

Remove a variable that has been initialized but otherwise unused since
revision 1.315.


# 00cbe31b 15-Nov-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Send B_PHYS out to pasture, it no longer serves any function.


# 28c94164 15-Nov-2003 Alan Cox <alc@FreeBSD.org>

- Remove the remaining now unnecessary checks for the buf's b_object being
NULL. See revision 1.421 for more detail.
- Remove GIANT_REQUIRED from vfs_unbusy_pages(). Discussed with: jeff


# 1415a09d 12-Nov-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Replace B_PHYS conditional assignment to bio_offset with KASSERT check
to see that the originating code already did it right.


# fde81c7d 12-Nov-2003 Kirk McKusick <mckusick@FreeBSD.org>

Update the statfs structure with 64-bit fields to allow
accurate reporting of multi-terabyte filesystem sizes.

You should build and boot a new kernel BEFORE doing a `make world'
as the new kernel will know about binaries using the old statfs
structure, but an old kernel will not know about the new system
calls that support the new statfs structure. Running an old kernel
after a `make world' will cause programs such as `df' that do a
statfs system call to fail with a bad system call.

Reviewed by: Bruce Evans <bde@zeta.org.au>
Reviewed by: Tim Robbins <tjr@freebsd.org>
Reviewed by: Julian Elischer <julian@elischer.org>
Reviewed by: the hoards of <arch@freebsd.org>
Sponsored by: DARPA & NAI Labs.


# e35e0182 10-Nov-2003 Alan Cox <alc@FreeBSD.org>

- Revision 1.469 of vfs_subr.c resulted in the buf's b_object field being
consistency initialized. Consequently, a number of conditionals that
checked the validity of b_object before passing it to VM_OBJECT_LOCK()
and VM_OBJECT_UNLOCK() are no longer needed.


# 15a93fcc 03-Nov-2003 Kirk McKusick <mckusick@FreeBSD.org>

Allow the bufdaemon and update daemon processes to skip the
waitrunningbufspace() calls so that they are always able to
proceed and clean up buffer space.

Submitted by: Brian Fundakowski Feldman <green@freebsd.org>


# 787f162d 23-Oct-2003 John Baldwin <jhb@FreeBSD.org>

Move the P_COWINPROGRESS flag from being a per-process p_flag to being a
per-thread td_pflag which doesn't require any locks to read or write as it
is only read or written by curthread on itself.

Glanced at by: mckusick


# 68b00bf6 21-Oct-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Remove KASSERTS on B_PHYS for vmapbuf() and vunmapbuf(), B_PHYS is going
away.


# 48ae2ddd 19-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Add vm object locking to vfs_clean_pages() and vfs_bio_set_validclean().
This is to synchronize access to the vm page's valid field by
vm_page_set_validclean().


# 2d6a9d07 18-Oct-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Initialize b_iooffset before calling VOP_[SPEC]STRATEGY


# 0efedd88 18-Oct-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Don't report b_pblkno, it is going away.


# 583b92e3 18-Oct-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Convert some if(bla) panic("foo") to KASSERTS to improve grep-ability.


# d986d458 18-Oct-2003 Poul-Henning Kamp <phk@FreeBSD.org>

The size and contents of the DEV_STRATEGY() macro has progressed to
the point where it being a macro is no longer sensible, and it will
only be more so in days to come.

BIO_STRATEGY() is now only used from DEV_STRATEGY() and should not
be used directly anymore.

Put the contents of both in the new function dev_strategy() and
make DEV_STRATEGY() call that function.

In addition, this allows us to make the rather magic bufdonebio()
helper function static.

This alse saves hunderedandsome bytes of code in a typical kernel.


# 85b9831d 13-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- Add a mising vn_finished_write()

Pointy hat: jeff
Found by: robert
Obtained from: kirk


# d58e70a0 12-Oct-2003 Alan Cox <alc@FreeBSD.org>

In vfs_bio_clrbuf(), ignore the state of the object lock if the page is the
"bogus" page.

Found by: tegge


# 08814d66 10-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Synchronize access to a page's valid field in vfs_bio_clrbuf()
by using the lock from its containing object.
- Remove GIANT_REQUIRED from vm_hold_load_pages().


# d1cf0fc7 05-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- Add a missing vn_start_write() to flushbufqueues(). This could have
caused snapshot related problems.
- The vp can not be NULL here or we would panic in vfs_bio_awrite(). Stop
confusing the logic by checking for it in several places.

Submitted by: kirk and then rototilled by me to remove vp == NULL checks.


# 6ec2fca5 04-Oct-2003 Alan Cox <alc@FreeBSD.org>

Eliminate some unnecessary uses of the vm page queues lock around the
vm page's valid field. This field is being synchronized using the
containing vm object's lock.


# bf0da100 04-Oct-2003 Alan Cox <alc@FreeBSD.org>

- Extend the scope the vm object lock to cover calls to
vm_page_is_valid().
- Assert that the lock on the containing vm object is held in
vm_page_is_valid().


# c76789ca 21-Sep-2003 Alan Cox <alc@FreeBSD.org>

- vm_hold_free_pages() should lock the kernel object. (The pages being
freed belong to the kernel object.)
- Increase the granularity of the vm object locking in vm_hold_load_pages()
in order to reduce the number of times that we acquire and release the
same lock.


# 35b86dc8 14-Sep-2003 Alan Cox <alc@FreeBSD.org>

Correct a typo in the previous revision.


# 58abfe00 12-Sep-2003 Alan Cox <alc@FreeBSD.org>

Convert vmapbuf() from using pmap_extract() to using
pmap_extract_and_hold(). Note, however, that GIANT_REQUIRED should not be
removed until all platforms fully implement the "prot" parameter to
pmap_extract_and_hold().

Reviewed by: tegge


# d919a11d 31-Aug-2003 Jeff Roberson <jeff@FreeBSD.org>

- Define a new flag for getblk(): GB_NOCREAT. This flag causes getblk() to
bail out if the buffer is not already present.
- The buffer returned by incore() is not locked and should not be sent to
brelse(). Use getblk() with the new GB_NOCREAT flag to preserve the
desired semantics.


# a7db5590 30-Aug-2003 Jeff Roberson <jeff@FreeBSD.org>

- If there is no vp assume that BKGRDINPROG is not set and set RELPBUF in
brelse().


# b5c61abd 30-Aug-2003 Jeff Roberson <jeff@FreeBSD.org>

- In some cases bp->b_vp can be NULL in brelse, don't try to lock the
interlock in that case.

Found by: alc


# 9e8147f3 28-Aug-2003 Marcel Moolenaar <marcel@FreeBSD.org>

In bufdone(), change the format specifier for m->valid and m->dirty to
a long type and explicitly cast m->valid and m->dirty to unsigned long.
When PAGE_SIZE is 32K, these fields are in fact unsigned long.


# 772a9659 28-Aug-2003 Alexander Kabaev <kan@FreeBSD.org>

Do not return with vnode interlock held.

Reviewed by: rwatson


# 9dbfeb0a 28-Aug-2003 Jeff Roberson <jeff@FreeBSD.org>

- Move BX_BKGRDWAIT and BX_BKGRDINPROG to BV_ and the b_vflags field.
- Surround all accesses of the BKGRD{WAIT,INPROG} flags with the vnode
interlock.
- Don't use the B_LOCKED flag and QUEUE_LOCKED for background write
buffers. Check for the BKGRDINPROG flag before recycling or throwing
away a buffer. We do this instead because it is not safe for us to move
the original buffer to a new queue from the callback on the background
write buffer.
- Remove the B_LOCKED flag and the locked buffer queue. They are no longer
used.
- The vnode interlock is used around checks for BKGRDINPROG where it may
not be strictly necessary. If we hold the buf lock the a back-ground
write will not be started without our knowledge, one may only be
completed while we're not looking. Rather than remove the code, Document
two of the places where this extra locking is done. A pass should be
done to verify and minimize the locking later.


# b7ad744d 23-Aug-2003 Alan Cox <alc@FreeBSD.org>

Hold the page queues lock when performing vm_page_clear_dirty() and
vm_page_set_invalid().


# 4bfd22f2 02-Aug-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Grab Giant in bufdonebio() since drivers may not hold it.

This only protects the "struct buf" consumers (ie: DEV_STRATEGY()),
but does not protect BIO_STRATEGY() users.


# 105660e8 01-Aug-2003 Alan Cox <alc@FreeBSD.org>

Eliminate an abuse of kmem_alloc_pageable() in bufinit()
by using VM_ALLOC_NOOBJ to allocate the bogus page.

Reviewed by: tegge


# 56873368 20-Jun-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Initialize b_saveaddr when we hand out buffers


# f717a9d0 11-Jun-2003 Alan Cox <alc@FreeBSD.org>

Lock the vm object when removing a page.


# 677b542e 10-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# 17a13919 31-May-2003 Poul-Henning Kamp <phk@FreeBSD.org>

The IO_NOWDRAIN and B_NOWDRAIN hacks are no longer needed to prevent
deadlocks with vnode backed md(4) devices because md now uses a
kthread to run the bio requests instead of doing it directly from
the bio down path.


# 01dfc1de 27-Apr-2003 Alan Cox <alc@FreeBSD.org>

Finish the vm_object locking for this file, including holding the vm_object
lock when accessing the vm_object's flags or calling vm_page_lookup().


# af3e0bb2 26-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Lock the vm_object when performing vm_page_alloc() in allocbuf().


# 097d4338 19-Apr-2003 Alan Cox <alc@FreeBSD.org>

Lock the vm_object in vfs_busy_pages().


# 0fa05eae 19-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Lock the vm_object when performing vm_object_pip_subtract().
- Assert that the vm_object lock is held in vm_object_pip_subtract().


# 0d420ad3 19-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Lock the vm_object when performing vm_object_pip_wakeupn().
- Assert that the vm_object lock is held in vm_object_pip_wakeupn().
- Add a new macro VM_OBJECT_LOCK_ASSERT().


# de5ef101 13-Apr-2003 Alan Cox <alc@FreeBSD.org>

Update locking on the kernel_object to use the new macros.


# 0b556837 05-Apr-2003 Alan Cox <alc@FreeBSD.org>

Remove an unnecessary trunc_page() from vmapbuf().

Reviewed by: tegge


# 08468b6a 03-Apr-2003 Alan Cox <alc@FreeBSD.org>

o Check the b_bufsize passed to vmapbuf() returning an error
if it is invalid.
o Remove a debugging printf() from vmapbuf().

Suggested by: tegge


# d086f85a 30-Mar-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Preparation commit before I start on the bioqueue lockdown:

Collect all the bits of bioqueue handing in subr_disk.c, vfs_bio.c is big
enough as it is and disksort already lives in subr_disk.c.


# 5bbb8060 26-Mar-2003 Tor Egge <tegge@FreeBSD.org>

Add support for reading directly from file to userland buffer when the
O_DIRECT descriptor status flag is set and both offset and length is a
multiple of the physical media sector size.


# 227f9a1c 24-Mar-2003 Jake Burkholder <jake@FreeBSD.org>

- Add vm_paddr_t, a physical address type. This is required for systems
where physical addresses larger than virtual addresses, such as i386s
with PAE.
- Use this to represent physical addresses in the MI vm system and in the
i386 pmap code. This also changes the paddr parameter to d_mmap_t.
- Fix printf formats to handle physical addresses >4G in the i386 memory
detection code, and due to kvtop returning vm_paddr_t instead of u_long.

Note that this is a name change only; vm_paddr_t is still the same as
vm_offset_t on all currently supported platforms.

Sponsored by: DARPA, Network Associates Laboratories
Discussed with: re, phk (cdevsw change)


# b4b138c2 18-Mar-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Including <sys/stdint.h> is (almost?) universally only to be able to use
%j in printfs, so put a newsted include in <sys/systm.h> where the printf
prototype lives and save everybody else the trouble.


# 749ffa4e 13-Mar-2003 Jeff Roberson <jeff@FreeBSD.org>

- Add a lock for protecting against msleep(bp, ...) wakeup(bp) races.
- Create a new function bdone() which sets B_DONE and calls wakup(bp). This
is suitable for use as b_iodone for buf consumers who are not going
through the buf cache.
- Create a new function bwait() which waits for the buf to be done at a set
priority and with a specific wmesg.
- Replace several cases where the above functionality was implemented
without locking with the new functions.


# 09f11da5 13-Mar-2003 Jeff Roberson <jeff@FreeBSD.org>

- Remove a race between fsync like functions and flushbufqueues() by
requiring locked bufs in vfs_bio_awrite(). Previously the buf could
have been written out by fsync before we acquired the buf lock if it
weren't for giant. The cluster_wbuild() handles this race properly but
the single write at the end of vfs_bio_awrite() would not.
- Modify flushbufqueues() so there is only one copy of the loop. Pass a
parameter in that says whether or not we should sync bufs with deps.
- Call flushbufqueues() a second time and then break if we couldn't find
any bufs without deps.


# 7261f5f6 03-Mar-2003 Jeff Roberson <jeff@FreeBSD.org>

- Add a new 'flags' parameter to getblk().
- Define one flag GB_LOCK_NOWAIT that tells getblk() to pass the LK_NOWAIT
flag to the initial BUF_LOCK(). This will eventually be used in cases
were we want to use a buffer only if it is not currently in use.
- Convert all consumers of the getblk() api to use this extra parameter.

Reviwed by: arch
Not objected to by: mckusick


# 491081fa 01-Mar-2003 Jeff Roberson <jeff@FreeBSD.org>

- Hold the vnode interlock across calls to bgetvp instead of acquiring it
internally. This is required to stop multiple bufs from being associated
with a single lblkno.


# bff5362b 28-Feb-2003 Jeff Roberson <jeff@FreeBSD.org>

- gc USE_BUFHASH. The smp locking of the buf cache renders this useless.


# 7e734c41 25-Feb-2003 Kirk McKusick <mckusick@FreeBSD.org>

When doing cleanup of excessive buffers in bdwrite (see kern/vfs_bio.c
delta 1.371) we must ensure that we do not get ourselves into a
recursive trap endlessly trying to clean up after ourselves.

Reported by: Attila Nagy <bra@fsn.hu>
Sponsored by: DARPA & NAI Labs.


# 2e3981a7 25-Feb-2003 Jeff Roberson <jeff@FreeBSD.org>

- Add the missing NULL interlock argument to a recently added BUF_LOCK.


# 3a7053cb 24-Feb-2003 Kirk McKusick <mckusick@FreeBSD.org>

Prevent large files from monopolizing the system buffers. Keep
track of the number of dirty buffers held by a vnode. When a
bdwrite is done on a buffer, check the existing number of dirty
buffers associated with its vnode. If the number rises above
vfs.dirtybufthresh (currently 90% of vfs.hidirtybuffers), one
of the other (hopefully older) dirty buffers associated with
the vnode is written (using bawrite). In the event that this
approach fails to curb the growth in it the vnode's number of
dirty buffers (due to soft updates rollback dependencies),
the more drastic approach of doing a VOP_FSYNC on the vnode
is used. This code primarily affects very large and actively
written files such as snapshots. This change should eliminate
hanging when taking snapshots or doing background fsck on
very large filesystems.

Hopefully, one day it will be possible to cache filesystem
metadata in the VM cache as is done with file data. As it
stands, only the buffer cache can be used which limits total
metadata storage to about 20Mb no matter how much memory is
available on the system. This rather small memory gets badly
thrashed causing a lot of extra I/O. For example, taking a
snapshot of a 1Tb filesystem minimally requires about 35,000
write operations, but because of the cache thrashing (we only
have about 350 buffers at our disposal) ends up doing about
237,540 I/O's thus taking twenty-five minutes instead of four
if it could run entirely in the cache.

Reported by: Attila Nagy <bra@fsn.hu>
Sponsored by: DARPA & NAI Labs.


# 17661e5a 24-Feb-2003 Jeff Roberson <jeff@FreeBSD.org>

- Add an interlock argument to BUF_LOCK and BUF_TIMELOCK.
- Remove the buftimelock mutex and acquire the buf's interlock to protect
these fields instead.
- Hold the vnode interlock while locking bufs on the clean/dirty queues.
This reduces some cases from one BUF_LOCK with a LK_NOWAIT and another
BUF_LOCK with a LK_TIMEFAIL to a single lock.

Reviewed by: arch, mckusick


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 71146186 16-Feb-2003 Jeff Roberson <jeff@FreeBSD.org>

- Introduce a new function bremfreel() that does a bremfree with the buf
queue lock already held.
- In getblk() and flushbufqueues() use bremfreel() while we still have the
buf queue lock held to keep the lists consistent.
- Add LK_NOWAIT to two cases where we're essentially asserting that the bufs
are not locked while acquiring the locks. This will make sure that we get
the appropriate panic() and not another one for sleeping with a lock held.


# 25c43254 10-Feb-2003 Jeff Roberson <jeff@FreeBSD.org>

- Add a comment about a race that will happen without Giant.


# c7b716cc 10-Feb-2003 Jeff Roberson <jeff@FreeBSD.org>

- Unlock the nblock after the loop in bwillwrite().


# 7137d635 09-Feb-2003 Jeff Roberson <jeff@FreeBSD.org>

- In getnewbuf() unlock the bq lock prior to sleeping when we're out of
buffers.

Submitted by: tegge


# 3306adcf 09-Feb-2003 Jeff Roberson <jeff@FreeBSD.org>

- Correct another atomic op.

Spotted by: alc


# 69953c84 08-Feb-2003 Jeff Roberson <jeff@FreeBSD.org>

- Move some code out from #ifdef INVARIANTS.


# 767b9a52 09-Feb-2003 Jeff Roberson <jeff@FreeBSD.org>

- Cleanup unlocked accesses to buf flags by introducing a new b_vflag member
that is protected by the vnode lock.
- Move B_SCANNED into b_vflags and call it BV_SCANNED.
- Create a vop_stdfsync() modeled after spec's sync.
- Replace spec_fsync, msdos_fsync, and hpfs_fsync with the stdfsync and some
fs specific processing. This gives all of these filesystems proper
behavior wrt MNT_WAIT/NOWAIT and the use of the B_SCANNED flag.
- Annotate the locking in buf.h


# 15553af7 09-Feb-2003 Jeff Roberson <jeff@FreeBSD.org>

- spell add 'add' and not 'subtract' in an atomic op.

Spotted by: alc
Pointy hat to: jeff


# d85be482 09-Feb-2003 Jeff Roberson <jeff@FreeBSD.org>

- Lock down the buffer cache's infrastructure code. This includes locks on
buf lists, synchronization variables, and atomic ops for the counters.
This change does not remove giant from any code although some pushdown
may be possible.
- In vfs_bio_awrite() don't access buf fields without the buf lock.


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 2d5c7e45 20-Jan-2003 Matthew Dillon <dillon@FreeBSD.org>

Close the remaining user address mapping races for physical
I/O, CAM, and AIO. Still TODO: streamline useracc() checks.

Reviewed by: alc, tegge
MFC after: 7 days


# 28ec30cd 20-Jan-2003 Alan Cox <alc@FreeBSD.org>

- Hold the page queues lock around vm_page_hold().
- Assert that the page queues lock rather than Giant is held in
vm_page_hold().


# 6eb07b4a 16-Jan-2003 Alan Cox <alc@FreeBSD.org>

Fix two long-standing, but likely harmless, errors in the use of
vm_pageout_deficit:
1. Update vm_pageout_deficit before VM_WAIT. There is no sense in
delaying the update; the sooner the pageout daemon receives this
information the better. Reviewed by: tegge
2. Update vm_pageout_deficit according to the number of pages still
needed to complete the allocation, not the original size of the
allocation. Submitted by: tegge

(These errors have existed since the introduction of vm_pageout_deficit
in revision 1.144.)


# f5979003 15-Jan-2003 Matthew Dillon <dillon@FreeBSD.org>

Merge all the various copies of vmapbuf() and vunmapbuf() into a single
portable copy. Note that pmap_extract() must be used instead of
pmap_kextract().

This is precursor work to a reorganization of vmapbuf() to close remaining
user/kernel races (which can lead to a panic).


# b0ef8c5f 13-Jan-2003 Alan Cox <alc@FreeBSD.org>

- Update vm_pageout_deficit using atomic operations. It's a simple
counter outside the scope of existing locks.
- Eliminate a redundant clearing of vm_pageout_deficit.


# 8febaa4d 11-Jan-2003 Alan Cox <alc@FreeBSD.org>

vm_hold_load_pages() needn't clear PG_ZERO because it didn't pass
VM_ALLOC_ZERO to vm_page_alloc(). (PG_ZERO is clear by default.)


# 1f179656 07-Jan-2003 Alan Cox <alc@FreeBSD.org>

Make bogus_offset local to bufinit().


# ea480413 05-Jan-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Fix cut&paste bug which would result in a panic because buffer was
being biodone'ed multiple times.


# 9ce90443 05-Jan-2003 Alan Cox <alc@FreeBSD.org>

Allocate bogus_page with VM_ALLOC_WIRED. (Previously, bogus_page's
allocation incremented the global count of wired pages, but not the
page's own wire count. This inconsistency was introduced in
revision 1.230.)


# f5b11b6e 04-Jan-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Temporarily introduce a new VOP_SPECSTRATEGY operation while I try
to sort out disk-io from file-io in the vm/buffer/filesystem space.

The intent is to sort VOP_STRATEGY calls into those which operate
on "real" vnodes and those which operate on VCHR vnodes. For
the latter kind, the call will be changed to VOP_SPECSTRATEGY,
possibly conditionally for those places where dual-use happens.

Add a default VOP_SPECSTRATEGY method which will call the normal
VOP_STRATEGY. First time it is called it will print debugging
information. This will only happen if a normal vnode is passed
to VOP_SPECSTRATEGY by mistake.

Add a real VOP_SPECSTRATEGY in specfs, which does what VOP_STRATEGY
does on a VCHR vnode today.

Add a new VOP_STRATEGY method in specfs to catch instances where
the conversion to VOP_SPECSTRATEGY has not yet happened. Handle
the request just like we always did, but first time called print
debugging information.

Apart up to two instances of console messages per boot, this amounts
to a glorified no-op commit.

If you get any of the messages on your console I would very much
like a copy of them mailed to phk@freebsd.org


# 7b330b22 04-Jan-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Don't call VOP_BMAP on VCHR vnodes when the logical and physical block
numbers are identical: it cannot even hope to accomplish anything.


# 86270230 02-Jan-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Convert calls to BUF_STRATEGY to VOP_STRATEGY calls. This is a no-op since
all BUF_STRATEGY did in the first place was call VOP_STRATEGY.


# 9d5abbdd 01-Jan-2003 Jens Schweikhardt <schweikh@FreeBSD.org>

Correct typos, mostly s/ a / an / where appropriate. Some whitespace cleanup,
especially in troff files.


# d7467893 26-Dec-2002 Alan Cox <alc@FreeBSD.org>

Hold the page queues lock when calling vm_page_flag_clear().


# 0cb6c004 23-Dec-2002 Alan Cox <alc@FreeBSD.org>

- Hold the kernel_object's lock around vm_page_alloc(kernel_object,...).
- Hold the page queues lock around vm_page_wakeup().


# 0f5f789c 13-Dec-2002 Kirk McKusick <mckusick@FreeBSD.org>

The buffer daemon cannot skip over buffers owned by locked inodes as
they may be the only viable ones to flush. Thus it will now wait for
an inode lock if the other alternatives will result in rollbacks (and
immediate redirtying of the buffer). If only buffers with rollbacks
are available, one will be flushed, but then the buffer daemon will
wait briefly before proceeding. Failing to wait briefly effectively
deadlocks a uniprocessor since every other process writing to that
filesystem will wait for the buffer daemon to clean up which takes
close enough to forever to feel like a deadlock.

Reported by: Archie Cobbs <archie@dellroad.org>
Sponsored by: DARPA & NAI Labs.
Approved by: re


# 178949e0 23-Nov-2002 Alan Cox <alc@FreeBSD.org>

Hold the page queues/flags lock when calling vm_page_set_validclean().

Approved by: re


# 4fec79be 16-Nov-2002 Alan Cox <alc@FreeBSD.org>

Now that pmap_remove_all() is exported by our pmap implementations
use it directly.


# d154fb4f 10-Nov-2002 Alan Cox <alc@FreeBSD.org>

When prot is VM_PROT_NONE, call pmap_page_protect() directly rather than
indirectly through vm_page_protect(). The one remaining page flag that
is updated by vm_page_protect() is already being updated by our various
pmap implementations.

Note: A later commit will similarly change the VM_PROT_READ case and
eliminate vm_page_protect().


# bc7bdd50 17-Oct-2002 Kirk McKusick <mckusick@FreeBSD.org>

When the number of dirty buffers rises too high, the buf_daemon runs
to help clean up. After selecting a potential buffer to write, this
patch has it acquire a lock on the vnode that owns the buffer before
trying to write it. The vnode lock is necessary to avoid a race with
some other process holding the vnode locked and trying to flush its
dirty buffers. In particular, if the vnode in question is a snapshot
file, then the race can lead to a deadlock. To avoid slowing down the
buf_daemon, it does a non-blocking lock request when trying to lock
the vnode. If it fails to get the lock it skips over the buffer and
continues down its queue looking for buffers to flush.

Sponsored by: DARPA & NAI Labs.


# 53cc4793 28-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unused includes.
Clarify the intention of a while();
Move a local variable to avoid potential name-confusion.


# 37c84183 28-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Be consistent about "static" functions: if the function is marked
static in its prototype, mark it static at the definition too.

Inspired by: FlexeLint warning #512


# 54286a04 27-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Correctly order VI_UNLOCK(), local variables and block comment.


# 089cf428 26-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Make biowait() check bio_error before the BIO_ERROR flag, to propery
catch internal GEOM use of bio_error.

Sponsored by: DARPA & NAI Labs.


# b7227b77 24-Sep-2002 Jeff Roberson <jeff@FreeBSD.org>

- Lock accesses to v_numoutput.
- Lock calls to gbincore.


# f986355c 15-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

s/Danglish/English/
Some style issues.
Change the timeout to be hz/10 instead of hz.

Brucification by: bde.


# 028e9e59 14-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Un-inline the non-trivial "trivial" bio* functions.
Untangle devstat_end_transaction_bio()


# 06be2aaa 14-Sep-2002 Nate Lawson <njl@FreeBSD.org>

Remove all use of vnode->v_tag, replacing with appropriate substitutes.
v_tag is now const char * and should only be used for debugging.

Additionally:
1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK
2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which
is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP.

Suggested by: phk
Reviewed by: bde, rwatson (earlier version)


# c7143e71 13-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Oops, broke the build there. Uninline biodone() now that it is non-trivial.

Introduce biowait() function. Currently there is a race condition and the
mitigation is a timeout/retry. It is not obvious what kind of locking (if any)
is suitable for BIO_DONE, since the majority of users take are of this
themselves, and only a few places actually rely on the wakeup.

Sponsored by: DARPA & NAI Labs.


# 447b3772 29-Aug-2002 Peter Wemm <peter@FreeBSD.org>

Change hw.physmem and hw.usermem to unsigned long like they used to be
in the original hardwired sysctl implementation.

The buf size calculator still overflows an integer on machines with large
KVA (eg: ia64) where the number of pages does not fit into an int. Use
'long' there.

Change Maxmem and physmem and related variables to 'long', mostly for
completeness. Machines are not likely to overflow 'int' pages in the
near term, but then again, 640K ought to be enough for anybody. This
comes for free on 32 bit machines, so why not?


# 93b0017f 25-Aug-2002 Philippe Charnier <charnier@FreeBSD.org>

Replace various spelling with FALLTHROUGH which is lint()able


# e6e370a7 04-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Replace v_flag with v_iflag and v_vflag
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.

Idea stolen from: BSD/OS


# 33278722 03-Aug-2002 Alan Cox <alc@FreeBSD.org>

o Convert two instances of vm_page_sleep_busy() to vm_page_sleep_if_busy()
with appropriate page queue locking.


# 46086ddf 01-Aug-2002 Alan Cox <alc@FreeBSD.org>

o Acquire the page queues lock before calling vm_page_io_finish().
o Assert that the page queues lock is held in vm_page_io_finish().


# 1812190d 30-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Replace vm_page_sleep_busy() with vm_page_sleep_if_busy()
in vfs_busy_pages().


# 4aca0b15 19-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Use vm_page_alloc(... | VM_ALLOC_WIRED) in place of vm_page_wire().


# 7aca6291 19-Jul-2002 Kirk McKusick <mckusick@FreeBSD.org>

Add support to UFS2 to provide storage for extended attributes.
As this code is not actually used by any of the existing
interfaces, it seems unlikely to break anything (famous
last words).

The internal kernel interface to manipulate these attributes
is invoked using two new IO_ flags: IO_NORMAL and IO_EXT.
These flags may be specified in the ioflags word of VOP_READ,
VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that
you want to do I/O to the normal data part of the file and
IO_EXT means that you want to do I/O to the extended attributes
part of the file. IO_NORMAL and IO_EXT are mutually exclusive
for VOP_READ and VOP_WRITE, but may be specified individually
or together in the case of VOP_TRUNCATE. For example, when
removing a file, VOP_TRUNCATE is called with both IO_NORMAL
and IO_EXT set. For backward compatibility, if neither IO_NORMAL
nor IO_EXT is set, then IO_NORMAL is assumed.

Note that the BA_ and IO_ flags have been `merged' so that they
may both be used in the same flags word. This merger is possible
by assigning the IO_ flags to the low sixteen bits and the BA_
flags the high sixteen bits. This works because the high sixteen
bits of the IO_ word is reserved for read-ahead and help with
write clustering so will never be used for flags. This merge
lets us get away from code of the form:

if (ioflags & IO_SYNC)
flags |= BA_SYNC;

For the future, I have considered adding a new field to the
vattr structure, va_extsize. This addition could then be
exported through the stat structure to allow applications to
find out the size of the extended attribute storage and also
would provide a more standard interface for truncating them
(via VOP_SETATTR rather than VOP_TRUNCATE).

I am also contemplating adding a pathconf parameter (for
concreteness, lets call it _PC_MAX_EXTSIZE) which would
let an application determine the maximum size of the extended
atribute storage.

Sponsored by: DARPA & NAI Labs.


# 91757095 14-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Lock page queue accesses by vm_page_wire().


# 5123aaef 13-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Lock some page queue accesses, in particular, those by vm_page_unwire().


# d331c5d4 10-Jul-2002 Matthew Dillon <dillon@FreeBSD.org>

Replace the global buffer hash table with per-vnode splay trees using a
methodology similar to the vm_map_entry splay and the VM splay that Alan
Cox is working on. Extensive testing has appeared to have shown no
increase in overhead.

Disadvantages
Dirties more cache lines during lookups.

Not as fast as a hash table lookup (but still N log N and optimal
when there is locality of reference).

Advantages
vnode->v_dirtyblkhd is now perfectly sorted, making fsync/sync/filesystem
syncer operate more efficiently.

I get to rip out all the old hacks (some of which were mine) that tried
to keep the v_dirtyblkhd tailq sorted.

The per-vnode splay tree should be easier to lock / SMPng pushdown on
vnodes will be easier.

This commit along with another that Alan is working on for the VM page
global hash table will allow me to implement ranged fsync(), optimize
server-side nfs commit rpcs, and implement partial syncs by the
filesystem syncer (aka filesystem syncer would detect that someone is
trying to get the vnode lock, remembers its place, and skip to the
next vnode).

Note that the buffer cache splay is somewhat more complex then other splays
due to special handling of background bitmap writes (multiple buffers with
the same lblkno in the same vnode), and B_INVAL discontinuities between the
old hash table and the existence of the buffer on the v_cleanblkhd list.

Suggested by: alc


# f20c8dde 07-Jul-2002 Bruce Evans <bde@FreeBSD.org>

Fixed some printf format errors (one new one reported by gcc and 3 nearby
old ones not reported by gcc). This helps unbreak LINT.


# 49244e35 06-Jul-2002 Jeff Roberson <jeff@FreeBSD.org>

Add two asserts that prove & document getblk and geteblk's behavior of
returning locked bufs.


# c031d11b 06-Jul-2002 Jeff Roberson <jeff@FreeBSD.org>

Fix a mistake in my last commit. Don't grab an extra reference to the object
in bp->b_object.


# 9a236af3 06-Jul-2002 Jeff Roberson <jeff@FreeBSD.org>

Fixup uses of GETVOBJECT.

- Cache a pointer to the vnode's object in the buf.
- Hold a reference to that object in addition to the vnode's reference just
to be consistent.
- Cleanup code that got the object indirectly through the vp and VOP calls.

This fixes at least one case where we were calling GETVOBJECT without a lock.
It also avoids an expensive layered call at the cost of another pointer in
struct buf.


# 0a84e7c8 23-Jun-2002 Maxime Henrion <mux@FreeBSD.org>

More 64 bits platforms warning fixes.

Reviewed by: rwatson


# ed22d6e9 22-Jun-2002 Matthew Dillon <dillon@FreeBSD.org>

Fix a bug in vfs_bio_clrbuf(). The single-page-clrbuf optimization was
improperly clearing more then just the invalid portions of the page. (This
bug is not known to have been triggered by anything).

Submitted by: tegge
MFC after: 7 days


# 1c85e6a3 21-Jun-2002 Kirk McKusick <mckusick@FreeBSD.org>

This commit adds basic support for the UFS2 filesystem. The UFS2
filesystem expands the inode to 256 bytes to make space for 64-bit
block pointers. It also adds a file-creation time field, an ability
to use jumbo blocks per inode to allow extent like pointer density,
and space for extended attributes (up to twice the filesystem block
size worth of attributes, e.g., on a 16K filesystem, there is space
for 32K of attributes). UFS2 fully supports and runs existing UFS1
filesystems. New filesystems built using newfs can be built in either
UFS1 or UFS2 format using the -O option. In this commit UFS1 is
the default format, so if you want to build UFS2 format filesystems,
you must specify -O 2. This default will be changed to UFS2 when
UFS2 proves itself to be stable. In this commit the boot code for
reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c)
as there is insufficient space in the boot block. Once the size of the
boot block is increased, this code can be defined.

Things to note: the definition of SBSIZE has changed to SBLOCKSIZE.
The header file <ufs/ufs/dinode.h> must be included before
<ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and
ufs_lbn_t.

Still TODO:
Verify that the first level bootstraps work for all the architectures.
Convert the utility ffsinfo to understand UFS2 and test growfs.
Add support for the extended attribute storage. Update soft updates
to ensure integrity of extended attribute storage. Switch the
current extended attribute interfaces to use the extended attribute
storage. Add the extent like functionality (framework is there,
but is currently never used).

Sponsored by: DARPA & NAI Labs.
Reviewed by: Poul-Henning Kamp <phk@freebsd.org>


# 53e645d9 06-Jun-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Use "bwrbg" as description when we sleep for background writing,
"biord" was misleading in every possible way.


# e31c615c 04-May-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Remove a six year old undocumented #ifdef : NO_B_MALLOC.


# 6008862b 04-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on: i386, alpha, sparc64


# d1b534df 02-Apr-2002 Matthew Dillon <dillon@FreeBSD.org>

brelse() was improperly clearing B_DELWRI in the B_DELWRI|B_INVAL case
without removing the buffer from the vnode's dirty buffer list, which
can result in a panic in NFS. Replaced the code with a call to bundirty()
which deals with it properly.

PR: kern/36108, kern/36174
Submitted by: various people
Special mention: to Danny Schales <dan@coes.LaTech.edu> for providing a core dump that helped me track this down.
MFC after: 1 day


# 4d77a549 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# 367b50a2 18-Mar-2002 Bruce Evans <bde@FreeBSD.org>

Fixed some printf format errors (hopefully all of the remaining daddr64_t
ones for GENERIC, and all others on the same line as those). Reformat
the printfs if necessary to avoid new long lones or old format printf
errors.


# ac59490b 16-Mar-2002 Jake Burkholder <jake@FreeBSD.org>

Convert all pmap_kenter/pmap_kremove pairs in MI code to use pmap_qenter/
pmap_qremove. pmap_kenter is not safe to use in MI code because it is not
guaranteed to flush the mapping from the tlb on all cpus. If the process
in question is preempted and migrates cpus between the call to pmap_kenter
and pmap_kremove, the original cpu will be left with stale mappings in its
tlb. This is currently not a problem for i386 because we do not use PG_G on
SMP, and thus all mappings are flushed from the tlb on context switches, not
just user mappings. This is not the case on all architectures, and if PG_G
is to be used with SMP on i386 it will be a problem. This was committed by
peter earlier as part of his fine grained tlb shootdown work for i386, which
was backed out for other reasons.

Reviewed by: peter


# 0d2af521 15-Mar-2002 Kirk McKusick <mckusick@FreeBSD.org>

Introduce the new 64-bit size disk block, daddr64_t. Change
the bio and buffer structures to have daddr64_t bio_pblkno,
b_blkno, and b_lblkno fields which allows access to disks
larger than a Terabyte in size. This change also requires
that the VOP_BMAP vnode operation accept and return daddr64_t
blocks. This delta should not affect system operation in
any way. It merely sets up the necessary interfaces to allow
the development of disk drivers that work with these larger
disk block addresses. It also allows for the development of
UFS2 which will use 64-bit block addresses.


# 628abf6c 15-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Giant pushdown for read/write/pread/pwrite syscalls.

kern/kern_descrip.c:
Aquire Giant in fdrop_locked when file refcount hits zero, this removes
the requirement for the caller to own Giant for the most part.

kern/kern_ktrace.c:
Aquire Giant in ktrgenio, simplifies locking in upper read/write syscalls.

kern/vfs_bio.c:
Aquire Giant in bwillwrite if needed.

kern/sys_generic.c
Giant pushdown, remove Giant for:
read, pread, write and pwrite.
readv and writev aren't done yet because of the possible malloc calls
for iov to uio processing.

kern/sys_socket.c
Grab giant in the socket fo_read/write functions.

kern/vfs_vnops.c
Grab giant in the vnode fo_read/write functions.


# f52bd684 05-Mar-2002 Eivind Eklund <eivind@FreeBSD.org>

* Move bswlist declaration and initialization from kern/vfs_bio.c to
vm/vm_pager.c, which is the only place it is used.
* Make the QUEUE_* definitions and bufqueues local to vfs_bio.c.
* constify buf_wmesg.


# eb8e6d52 05-Mar-2002 Eivind Eklund <eivind@FreeBSD.org>

Document all functions, global and static variables, and sysctls.
Includes some minor whitespace changes, and re-ordering to be able to document
properly (e.g, grouping of variables and the SYSCTL macro calls for them, where
the documentation has been added.)

Reviewed by: phk (but all errors are mine)


# d1693e17 27-Feb-2002 Peter Wemm <peter@FreeBSD.org>

Back out all the pmap related stuff I've touched over the last few days.
There is some unresolved badness that has been eluding me, particularly
affecting uniprocessor kernels. Turning off PG_G helped (which is a bad
sign) but didn't solve it entirely. Userland programs still crashed.


# bd1e3a0f 26-Feb-2002 Peter Wemm <peter@FreeBSD.org>

Jake further reduced IPI shootdowns on sparc64 in loops by using ranged
shootdowns in a couple of key places. Do the same for i386. This also
hides some physical addresses from higher levels and has it use the
generic vm_page_t's instead. This will help for PAE down the road.

Obtained from: jake (MI code, suggestions for MD part)


# 57c10583 22-Feb-2002 Poul-Henning Kamp <phk@FreeBSD.org>

GC: BIO_ORDERED, various infrastructure dealing with BIO_ORDERED.


# 986066d0 22-Feb-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Replace bowrite() with BUF_WRITE in ufs.

Remove bowrite(), it is now unused.

This is the first step in getting entirely rid of BIO_ORDERED which is
a generally accepted evil thing.

Approved by: mckusick


# 027df6bd 31-Jan-2002 Matthew Dillon <dillon@FreeBSD.org>

GC P_BUFEXHAUST leftovers, we've had a new mechanism to avoid buffer
cache lockups for over a year now.

MFC after: 0 days


# 3ebeaf59 13-Dec-2001 Matthew Dillon <dillon@FreeBSD.org>

This fixes a large number of bugs in our NFS client side code. A recent
commit by Kirk also fixed a softupdates bug that could easily be triggered
by server side NFS.

* An edge case with shared R+W mmap()'s and truncate whereby
the system would inappropriately clear the dirty bits on
still-dirty data. (applicable to all filesystems)

THIS FIX TEMPORARILY DISABLED PENDING FURTHER TESTING.
see vm/vm_page.c line 1641

* The straddle case for VM pages and buffer cache buffers when
truncating. (applicable to NFS client side)

* Possible SMP database corruption due to vm_pager_unmap_page()
not clearing the TLB for the other cpu's. (applicable to NFS
client side but could effect all filesystems). Note: not
considered serious since the corruption occurs beyond the file
EOF.

* When flusing a dirty buffer due to B_CACHE getting cleared,
we were accidently setting B_CACHE again (that is, bwrite() sets
B_CACHE), when we really want it to stay clear after the write
is complete. This resulted in a corrupt buffer. (applicable
to all filesystems but probably only triggered by NFS)

* We have to call vtruncbuf() when ftruncate()ing to remove
any buffer cache buffers. This is still tentitive, I may
be able to remove it due to the second bug fix. (applicable
to NFS client side)

* vnode_pager_setsize() race against nfs_vinvalbuf()... we have
to set n_size before calling nfs_vinvalbuf or the NFS code
may recursively vnode_pager_setsize() to the original value
before the truncate. This is what was causing the user mmap
bus faults in the nfs tester program. (applicable to NFS
client side)

* Fix to softupdates (see ufs/ffs/ffs_inode.c 1.73, commit made
by Kirk).

Testing program written by: Avadis Tevanian, Jr.
Testing program supplied by: jkh / Apple (see Dec2001 posting to freebsd-hackers with Subject 'NFS: How to make FreeBS fall on its face in one easy step')
MFC after: 1 week


# a4233d5d 08-Dec-2001 Matthew Dillon <dillon@FreeBSD.org>

The nbuf calculation was assuming that PAGE_SIZE = 4096 bytes, which is
bogus. The calculation has been adjusted to use units of kilobytes.

Noticed by: Chad David <davidc@acns.ab.ca>
MFC after: 1 week


# 8ba1f55b 08-Nov-2001 Matthew Dillon <dillon@FreeBSD.org>

Placemark an interrupt race in -current which is currently protected by
Giant. -stable will get spl*() fixes for the race.

Reported by: Rob Anderson <rob@isilon.com>
MFC after: 0 days


# 7e76bb56 05-Nov-2001 Matthew Dillon <dillon@FreeBSD.org>

Implement IO_NOWDRAIN and B_NOWDRAIN - prevents the buffer cache from blocking
in wdrain during a write. This flag needs to be used in devices whos
strategy routines turn-around and issue another high level I/O, such as
when MD turns around and issues a VOP_WRITE to vnode backing store, in order
to avoid deadlocking the dirty buffer draining code.

Remove a vprintf() warning from MD when the backing vnode is found to be
in-use. The syncer of buf_daemon could be flushing the backing vnode at
the time of an MD operation so the warning is not correct.

MFC after: 1 week


# 5eb13f76 21-Oct-2001 Matthew Dillon <dillon@FreeBSD.org>

Documentation

MFC after: 1 day


# bd78cece 11-Oct-2001 John Baldwin <jhb@FreeBSD.org>

Change the kernel's ucred API as follows:
- crhold() returns a reference to the ucred whose refcount it bumps.
- crcopy() now simply copies the credentials from one credential to
another and has no return value.
- a new crshared() primitive is added which returns true if a ucred's
refcount is > 1 and false (0) otherwise.


# 46cad576 26-Sep-2001 Matthew Dillon <dillon@FreeBSD.org>

Enable vmiodirenable by default. Remove incorrect comment from sysctl.conf.

MFC after: 1 week


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 0cf5e0eb 22-Aug-2001 Matthew Dillon <dillon@FreeBSD.org>

Remove the code that limited the buffer_map to 1/2 the size of the
kernel_map. maxbcache takes care of this now and the 1/2 limit can
interfere with testing.

Suggested by: bde


# 219d632c 21-Aug-2001 Matthew Dillon <dillon@FreeBSD.org>

Move most of the kernel submap initialization code, including the
timeout callwheel and buffer cache, out of the platform specific areas
and into the machine independant area. i386 and alpha adjusted here.
Other cpus can be fixed piecemeal.

Reviewed by: freebsd-smp, jake


# b219758f 27-Jul-2001 Peter Wemm <peter@FreeBSD.org>

Revert previous accidental commit. FWIW, it was part of enabling
VM caching of disks through mmap() and stopping syncing of open files
that had their last reference in the fs removed (ie: their unsync'ed
pages get discarded on close already, so I made it stop syncing too).


# 24a590a0 27-Jul-2001 Peter Wemm <peter@FreeBSD.org>

Fix cut/paste blunder. Serves me right for doing a last minute tweak
to what I had for some time.

Submitted by: bde


# 0cddd8f0 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.


# ac8f990b 24-May-2001 Matthew Dillon <dillon@FreeBSD.org>

This patch implements O_DIRECT about 80% of the way. It takes a patchset
Tor created a while ago, removes the raw I/O piece (that has cache coherency
problems), and adds a buffer cache / VM freeing piece.

Essentially this patch causes O_DIRECT I/O to not be left in the cache, but
does not prevent it from going through the cache, hence the 80%. For
the last 20% we need a method by which the I/O can be issued directly to
buffer supplied by the user process and bypass the buffer cache entirely,
but still maintain cache coherency.

I also have the code working under -stable but the changes made to sys/file.h
may not be MFCable, so an MFC is not on the table yet.

Submitted by: tegge, dillon


# 8aa66068 23-May-2001 John Baldwin <jhb@FreeBSD.org>

- Always call bfreekva() w/o vm_mtx held.
- Always call vfs_setdirty() with vm_mtx held.
- Fix an old comment: vm_hold_unload_pages is called vm_hold_free_pages()
nowadays.
- Always call vm_hold_free_pages() w/o vm_mtx held.


# 23955314 18-May-2001 Alfred Perlstein <alfred@FreeBSD.org>

Introduce a global lock for the vm subsystem (vm_mtx).

vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb


# 60fb0ce3 28-Apr-2001 Greg Lehey <grog@FreeBSD.org>

Revert consequences of changes to mount.h, part 2.

Requested by: bde


# d98dc34f 23-Apr-2001 Greg Lehey <grog@FreeBSD.org>

Correct #includes to work with fixed sys/mount.h.


# 793d6d5d 18-Apr-2001 Poul-Henning Kamp <phk@FreeBSD.org>

bread() is a special case of breadn(), so don't replicate code.


# 0dfba3ce 17-Apr-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Write a switch statement as less obscure if statements.


# f84e29a0 17-Apr-2001 Poul-Henning Kamp <phk@FreeBSD.org>

This patch removes the VOP_BWRITE() vector.

VOP_BWRITE() was a hack which made it possible for NFS client
side to use struct buf with non-bio backing.

This patch takes a more general approach and adds a bp->b_op
vector where more methods can be added.

The success of this patch depends on bp->b_op being initialized
all relevant places for some value of "relevant" which is not
easy to determine. For now the buffers have grown a b_magic
element which will make such issues a tiny bit easier to debug.


# 5819ab3f 16-Apr-2001 Kirk McKusick <mckusick@FreeBSD.org>

Add debugging option to always read/write cylinder groups as full
sized blocks. To enable this option, use: `sysctl -w debug.bigcgs=1'.
Add debugging option to disable background writes of cylinder
groups. To enable this option, use: `sysctl -w debug.dobkgrdwrite=0'.
These debugging options should be tried on systems that are panicing
with corrupted cylinder group maps to see if it makes the problem
go away. The set of panics in question are:

ffs_clusteralloc: map mismatch
ffs_nodealloccg: map corrupted
ffs_nodealloccg: block not in map
ffs_alloccg: map corrupted
ffs_alloccg: block not in map
ffs_alloccgblk: cyl groups corrupted
ffs_alloccgblk: can't find blk in cyl
ffs_checkblk: partially free fragment

The following panics are less likely to be related to this problem,
but might be helped by these debugging options:

ffs_valloc: dup alloc
ffs_blkfree: freeing free block
ffs_blkfree: freeing free frag
ffs_vfree: freeing free inode

If you try these options, please report whether they helped reduce your
bitmap corruption panics to Kirk McKusick at <mckusick@mckusick.com>
and to Matt Dillon <dillon@earth.backplane.com>.


# 63692125 27-Feb-2001 Matthew Dillon <dillon@FreeBSD.org>

Fix lockup for loopback NFS mounts. The pipelined I/O limitations could be
hit on the client side and prevent the server side from retiring writes.
Pipeline operations turned off for all READs (no big loss since reads are
usually synchronous) and for NFS writes, and left on for the default bwrite().
(MFC expected prior to 4.3 freeze)

Testing by: mjacob, dillon


# 9ed346ba 08-Feb-2001 Bosko Milekic <bmilekic@FreeBSD.org>

Change and clean the mutex lock interface.

mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)


# 4e71e795 03-Feb-2001 Matthew Dillon <dillon@FreeBSD.org>

This commit represents work mainly submitted by Tor and slightly modified
by myself. It solves a serious vm_map corruption problem that can occur
with the buffer cache when block sizes > 64K are used. This code has been
heavily tested in -stable but only tested somewhat on -current. An MFC
will occur in a few days. My additions include the vm_map_simplify_entry()
and minor buffer cache boundry case fix.

Make the buffer cache use a system map for buffer cache KVM rather then a
normal map.

Ensure that VM objects are not allocated for system maps. There were cases
where a buffer map could wind up with a backing VM object -- normally
harmless, but this could also result in the buffer cache blocking in places
where it assumes no blocking will occur, possibly resulting in corrupted
maps.

Fix a minor boundry case in the buffer cache size limit is reached that
could result in non-optimal code.

Add vm_map_simplify_entry() calls to prevent 'creeping proliferation'
of vm_map_entry's in the buffer cache's vm_map. Previously only a simple
linear optimization was made. (The buffer vm_map typically has only a
handful of vm_map_entry's. This stabilizes it at that level permanently).

PR: 20609
Submitted by: (Tor Egge) tegge


# ef73ae4b 09-Jan-2001 Jake Burkholder <jake@FreeBSD.org>

Use PCPU_GET, PCPU_PTR and PCPU_SET to access all per-cpu variables
other then curproc.


# 2b6b0df7 26-Dec-2000 Matthew Dillon <dillon@FreeBSD.org>

This implements a better launder limiting solution. There was a solution
in 4.2-REL which I ripped out in -stable and -current when implementing the
low-memory handling solution. However, maxlaunder turns out to be the saving
grace in certain very heavily loaded systems (e.g. newsreader box). The new
algorithm limits the number of pages laundered in the first pageout daemon
pass. If that is not sufficient then suceessive will be run without any
limit.

Write I/O is now pipelined using two sysctls, vfs.lorunningspace and
vfs.hirunningspace. This prevents excessive buffered writes in the
disk queues which cause long (multi-second) delays for reads. It leads
to more stable (less jerky) and generally faster I/O streaming to disk
by allowing required read ops (e.g. for indirect blocks and such) to occur
without interrupting the write stream, amoung other things.

NOTE: eventually, filesystem write I/O pipelining needs to be done on a
per-device basis. At the moment it is globalized.


# ffc831da 15-Dec-2000 John Baldwin <jhb@FreeBSD.org>

Stick the kthread API in a kthread_* namespace, and the specialized kproc
functions in a kproc_* namespace.

Reviewed by: -arch


# 936524aa 18-Nov-2000 Matthew Dillon <dillon@FreeBSD.org>

Implement a low-memory deadlock solution.

Removed most of the hacks that were trying to deal with low-memory
situations prior to now.

The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.

Code has been added to stall in a low-memory situation prior to a vnode
being locked.

Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.

Implement a number of VFS/BIO fixes

(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.

In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.

Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.

In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.

There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.

Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>


# 1d7e3e42 02-Nov-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Take VBLK devices further out of their missery.

This should fix the panic I introduced in my previous commit on this topic.


# 35e0e5b3 20-Oct-2000 John Baldwin <jhb@FreeBSD.org>

Catch up to moving headers:
- machine/ipl.h -> sys/ipl.h
- machine/mutex.h -> sys/mutex.h


# a18b1f1d 03-Oct-2000 Jason Evans <jasone@FreeBSD.org>

Convert lockmgr locks from using simple locks to using mutexes.

Add lockdestroy() and appropriate invocations, which corresponds to
lockinit() and must be called to clean up after a lockmgr lock is no
longer needed.


# 9ff5ce6b 12-Sep-2000 Boris Popov <bp@FreeBSD.org>

Add three new VOPs: VOP_CREATEVOBJECT, VOP_DESTROYVOBJECT and VOP_GETVOBJECT.
They will be used by nullfs and other stacked filesystems to support full
cache coherency.

Reviewed in general by: mckusick, dillon


# 0384fff8 06-Sep-2000 Jason Evans <jasone@FreeBSD.org>

Major update to the way synchronization is done in the kernel. Highlights
include:

* Mutual exclusion is used instead of spl*(). See mutex(9). (Note: The
alpha port is still in transition and currently uses both.)

* Per-CPU idle processes.

* Interrupts are run in their own separate kernel threads and can be
preempted (i386 only).

Partially contributed by: BSDi (BSD/OS)
Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh


# 54e53ebd 25-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

Now that buffer locks can be recursive, we need to delete the panics
that complain about them.

Obtained from: Brian Fundakowski Feldman <green@FreeBSD.org>


# f2a2857b 11-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

Add snapshots to the fast filesystem. Most of the changes support
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.

Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).


# a2e7a027 16-Jun-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Virtualizes & untangles the bioops operations vector.

Ref: Message-ID: <18317.961014572@critter.freebsd.dk> To: current@


# e3975643 25-May-2000 Jake Burkholder <jake@FreeBSD.org>

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


# 740a1973 23-May-2000 Jake Burkholder <jake@FreeBSD.org>

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


# 9626b608 05-May-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by: peter


# 017ef345 01-May-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Give struct bio it's own call back mechanism.


# 95bdaa0e 30-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Hmm, diff/patch still doesn't like me.
Missed one s/biowait/bufwait/g


# 87150cb0 29-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

s/biowait/bufwait/g

Prodded by: several.


# c1462ad3 29-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Remove a leftover dysonism.


# 3389ae93 19-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Remove ~25 unneeded #include <sys/conf.h>
Remove ~60 unneeded #include <sys/malloc.h>


# 19583a80 18-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Don't declare common variables in include files:
move buftimelock til vfs_bio.c where it is initialized.


# 8177437d 14-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Complete the bio/buf divorce for all code below devfs::strategy

Exceptions:
Vinum untouched. This means that it cannot be compiled.
Greg Lehey is on the case.

CCD not converted yet, casts to struct buf (still safe)

atapi-cd casts to struct buf to examine B_PHYS


# c244d2de 02-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Move B_ERROR flag to b_ioflags and call it BIO_ERROR.

(Much of this done by script)

Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED.

Move b_pblkno and b_iodone_chain to struct bio while we transition, they
will be obsoleted once bio structs chain/stack.

Add bio_queue field for struct bio aware disksort.

Address a lot of stylistic issues brought up by bde.


# 8c125869 02-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Draw the outline of "struct bio".

Struct bio is the future carrier of I/O requests for "struct buf".


# 7c58e473 27-Mar-2000 Matthew Dillon <dillon@FreeBSD.org>

Commit the buffer cache cleanup patch to 4.x and 5.x. This patch fixes a
fragmentation problem due to geteblk() reserving too much space for the
buffer and imposes a larger granularity (16K) on KVA reservations for
the buffer cache to avoid fragmentation issues. The buffer cache size
calculations have been redone to simplify them (fewer defines, better
comments, less chance of running out of KVA).

The geteblk() fix solves a performance problem that DG was able reproduce.

This patch does not completely fix the KVA fragmentation problems, but
it goes a long way

Mostly Reviewed by: bde and others
Approved by: jkh


# b99c307a 20-Mar-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Rename the existing BUF_STRATEGY() to DEV_STRATEGY()

substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)

substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)

This patch is machine generated except for the ccd.c and buf.h parts.


# 21144e3b 20-Mar-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Remove B_READ, B_WRITE and B_FREEBUF and replace them with a new
field in struct buf: b_iocmd. The b_iocmd is enforced to have
exactly one bit set.

B_WRITE was bogusly defined as zero giving rise to obvious coding
mistakes.

Also eliminate the redundant struct buf flag B_CALL, it can just
as efficiently be done by comparing b_iodone to NULL.

Should you get a panic or drop into the debugger, complaining about
"b_iocmd", don't continue. It is likely to write on your disk
where it should have been reading.

This change is a step in the direction towards a stackable BIO capability.

A lot of this patch were machine generated (Thanks to style(9) compliance!)

Vinum users: Greg has not had time to test this yet, be careful.


# db5f635a 16-Mar-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Eliminate the undocumented, experimental, non-delivering and highly
dangerous MAX_PERF option.


# 71c87cfd 17-Jan-2000 Kirk McKusick <mckusick@FreeBSD.org>

Need to reset the buffer pointer to avoid reconsidering the same buffer
again (without this the rollback analysis was being lost). Should reduce
the write count for most workloads.

Submitted by: Craig A Soules <soules+@andrew.cmu.edu>


# ba4ad1fc 09-Jan-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Give vn_isdisk() a second argument where it can return a suitable errno.

Suggested by: bde


# cf60e8e4 09-Jan-2000 Kirk McKusick <mckusick@FreeBSD.org>

Several performance improvements for soft updates have been added:
1) Fastpath deletions. When a file is being deleted, check to see if it
was so recently created that its inode has not yet been written to
disk. If so, the delete can proceed to immediately free the inode.
2) Background writes: No file or block allocations can be done while the
bitmap is being written to disk. To avoid these stalls, the bitmap is
copied to another buffer which is written thus leaving the original
available for futher allocations.
3) Link count tracking. Constantly track the difference in i_effnlink and
i_nlink so that inodes that have had no change other than i_effnlink
need not be written.
4) Identify buffers with rollback dependencies so that the buffer flushing
daemon can choose to skip over them.


# 5e950839 07-Jan-2000 Luoqi Chen <luoqi@FreeBSD.org>

Introduce a mechanism to suspend/resume system processes. Suspend syncer
and bufdaemon prior to disk sync during system shutdown.


# 8f95b970 20-Dec-1999 Matthew Dillon <dillon@FreeBSD.org>

Reimplement buf_daemon / getnewbuf() interaction for dealing with
stressful situations. buf_daemon now makes a distinction between
being woken up and its sleep timing out, and as a consequence is now
much better able to dynamically tune itself to its environment.

Reviewed by: Alfred Perlstein <bright@wintelcom.net>


# e9cc4758 30-Nov-1999 Kirk McKusick <mckusick@FreeBSD.org>

Collect read and write counts for filesystems. This new code
drops the counting in bwrite and puts it all in spec_strategy.
I did some tests and verified that the counts collected for writes
in spec_strategy is identical to the counts that we previously
collected in bwrite. We now also get read counts (async reads
come from requests for read-ahead blocks). Note that you need
to compile a new version of mount to get the read counts printed
out. The old mount binary is completely compatible, the only
reason to install a new mount is to get the read counts printed.

Submitted by: Craig A Soules <soules+@andrew.cmu.edu>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>


# 38224dcd 22-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Convert various pieces of code to use vn_isdisk() rather than checking
for vp->v_type == VBLK.

In ccd: we don't need to call VOP_GETATTR to find the type of a vnode.

Reviewed by: sos


# 2e3c8fcb 16-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

This is a partial commit of the patch from PR 14914:

Alot of the code in sys/kern directly accesses the *Q_HEAD and *Q_ENTRY
structures for list operations. This patch makes all list operations
in sys/kern use the queue(3) macros, rather than directly accessing the
*Q_{HEAD,ENTRY} structures.

This batch of changes compile to the same object files.

Reviewed by: phk
Submitted by: Jake Burkholder <jake@checker.org>
PR: 14914


# 3f5bbb08 30-Oct-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Remove a #define which doesn't do miracles anymore.


# 923502ff 29-Oct-1999 Poul-Henning Kamp <phk@FreeBSD.org>

useracc() the prequel:

Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.


# 9782fb62 23-Oct-1999 Matthew Dillon <dillon@FreeBSD.org>

Adjust the buffer cache to better handle small-memory machines. A
slightly older version of this code was tested by BDE and I.

Also fixes a lockup situation when kva gets too fragmented.

Remove the maxvmiobufspace variable and sysctl, they are no longer
used. Also cleanup (remove) #if 0 sections from prior commits.

This code is more of a hack, but presumably the whole buffer cache
implementation is going to be rewritten in the next year so it's no
big deal.


# d1f088da 11-Oct-1999 Peter Wemm <peter@FreeBSD.org>

Trim unused options (or #ifdef for undoc options).

Submitted by: phk


# 4c6fc728 30-Sep-1999 Dmitrij Tejblum <dt@FreeBSD.org>

Count bogus_page as wired.


# d909b563 20-Sep-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix bug in brelse() regarding redirtying buffers on B_ERROR. brelse()
improperly ignored the B_INVAL flag when acting on the B_ERROR.
If both B_INVAL and B_ERROR are set the buffer is typically out of the
underlying device's block range and must be destroyed. If only B_ERROR
is set (for a write), a write error occured and operation remains as it
was before: the buffer must be redirtied to avoid corrupting the
filesystem state.

Reviewed by: David Greenman <dg@root.com>
Submitted by: Tor.Egge@fast.no


# d8a31f81 03-Sep-1999 Luoqi Chen <luoqi@FreeBSD.org>

Allow getblk() to be called from an idle context (by panic() inside
an interrupt handler).

Reviewed by: dillon


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# d009ccfa 23-Aug-1999 Bruce Evans <bde@FreeBSD.org>

Cast pointers to uintptr_t instead of casting them to u_long, and/or vice
versa. Cosmetic.


# 0ef1c826 08-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>,
a few lines into <sys/vnode.h>.

Add a few fields to struct specinfo, paving the way for the fun part.


# 67452993 26-Jul-1999 Alan Cox <alc@FreeBSD.org>

Add sysctl and support code to allow directories to be VMIO'd. The default
setting for the sysctl is OFF, which is the historical operation.

Submitted by: dillon


# 29a751bf 09-Jul-1999 Peter Wemm <peter@FreeBSD.org>

bufhashinit() is called with a caddr_t and is expected to return the
same in both the alpha and i386 ports.


# 02503783 08-Jul-1999 Kirk McKusick <mckusick@FreeBSD.org>

Condition in KASSERT was reversed.


# ad8ac923 08-Jul-1999 Kirk McKusick <mckusick@FreeBSD.org>

These changes appear to give us benefits with both small (32MB) and
large (1G) memory machine configurations. I was able to run 'dbench 32'
on a 32MB system without bring the machine to a grinding halt.

* buffer cache hash table now dynamically allocated. This will
have no effect on memory consumption for smaller systems and
will help scale the buffer cache for larger systems.

* minor enhancement to pmap_clearbit(). I noticed that
all the calls to it used constant arguments. Making
it an inline allows the constants to propogate to
deeper inlines and should produce better code.

* removal of inherent vfs_ioopt support through the emplacement
of appropriate #ifdef's, with John's permission. If we do not
find a use for it by the end of the year we will remove it entirely.

* removal of getnewbufloops* counters & sysctl's - no longer
necessary for debugging, getnewbuf() is now optimal.

* buffer hash table functions removed from sys/buf.h and localized
to vfs_bio.c

* VFS_BIO_NEED_DIRTYFLUSH flag and support code added
( bwillwrite() ), allowing processes to block when too many dirty
buffers are present in the system.

* removal of a softdep test in bdwrite() that is no longer necessary
now that bdwrite() no longer attempts to flush dirty buffers.

* slight optimization added to bqrelse() - there is no reason
to test for available buffer space on B_DELWRI buffers.

* addition of reverse-scanning code to vfs_bio_awrite().
vfs_bio_awrite() will attempt to locate clusterable areas
in both the forward and reverse direction relative to the
offset of the buffer passed to it. This will probably not
make much of a difference now, but I believe we will start
to rely on it heavily in the future if we decide to shift
some of the burden of the clustering closer to the actual
I/O initiation.

* Removal of the newbufcnt and lastnewbuf counters that Kirk
added. They do not fix any race conditions that haven't already
been fixed by the gbincore() test done after the only call
to getnewbuf(). getnewbuf() is a static, so there is no chance
of it being misused by other modules. ( Unless Kirk can think
of a specific thing that this code fixes. I went through it
very carefully and didn't see anything ).

* removal of VOP_ISLOCKED() check in flushbufqueues(). I do not
think this check is necessary, the buffer should flush properly
whether the vnode is locked or not. ( yes? ).

* removal of extra arguments passed to getnewbuf() that are not
necessary.

* missed cluster_wbuild() that had to be a cluster_wbuild_wb() in
vfs_cluster.c

* vn_write() now calls bwillwrite() *PRIOR* to locking the vnode,
which should greatly aid flushing operations in heavy load
situations - both the pageout and update daemons will be able
to operate more efficiently.

* removal of b_usecount. We may add it back in later but for now
it is useless. Prior implementations of the buffer cache never
had enough buffers for it to be useful, and current implementations
which make more buffers available might not benefit relative to
the amount of sophistication required to implement a b_usecount.
Straight LRU should work just as well, especially when most things
are VMIO backed. I expect that (even though John will not like
this assumption) directories will become VMIO backed some point soon.

Submitted by: Matthew Dillon <dillon@backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>


# e929c00d 03-Jul-1999 Kirk McKusick <mckusick@FreeBSD.org>

The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).

Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.

The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.

reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.

The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.

A small race condition was fixed in getpbuf() in vm/vm_pager.c.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>


# ddebd879 28-Jun-1999 Peter Wemm <peter@FreeBSD.org>

Hopefully fix the remaining glitches with the BUF_*() changes. This should
(really this time) fix pageout to swap and a couple of clustering cases.

This simplifies BUF_KERNPROC() so that it unconditionally reassigns the
lock owner rather than testing B_ASYNC and having the caller decide when
to do the reassign. At present this is required because some places use
B_CALL/b_iodone to free the buffers without B_ASYNC being set. Also,
vfs_cluster.c explicitly calls BUF_KERNPROC() when attaching the buffers
rather than the parent walking the cluster_head tailq.

Reviewed by: Kirk McKusick <mckusick@mckusick.com>


# 72283ee9 28-Jun-1999 Peter Wemm <peter@FreeBSD.org>

Fix a bug that was almost certainly making breadn() fail. BUF_KERNPROC()
was being called on the wrong bp - it should be called on the one that's
just about to be fed to VOP_STRATEGY().


# e6257a9a 26-Jun-1999 Peter Wemm <peter@FreeBSD.org>

GC the remnants of the old pre-softupdates update daemon. It's been
#if 0'd for a fair while now.


# 67812eac 25-Jun-1999 Kirk McKusick <mckusick@FreeBSD.org>

Convert buffer locking from using the B_BUSY and B_WANTED flags to using
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.


# 45623f31 21-Jun-1999 Kirk McKusick <mckusick@FreeBSD.org>

When allocating new buffers in getnewbuf, there are several points
at which we may sleep. So, after completing our buffer allocation
we must ensure that another process has not come along and allocated
a different buffer with the same identity. We do this by keeping a
global counter of the number of buffers that getnewbuf has allocated.
We save this count when we enter getnewbuf and check it when we are
about to return. If it has changed, then other buffers were allocated
while we were in getnewbuf, so we must return NULL to let our parent
know that it must recheck to see if it still needs the new buffer.
Hopefully this fix will eliminate the creation of duplicate buffers
with the same identity and the obscure corruptions that they cause.


# f9c8cab5 16-Jun-1999 Kirk McKusick <mckusick@FreeBSD.org>

Add a vnode argument to VOP_BWRITE to get rid of the last vnode
operator special case. Delete special case code from vnode_if.sh,
vnode_if.src, umap_vnops.c, and null_vnops.c.


# 01cf8ad0 15-Jun-1999 Tor Egge <tegge@FreeBSD.org>

If we still haven't got a sufficient number of free buffers after the
call to flushdirtybuffers() then sleep in waitfreebuffers().
PR: 11697
Reviewed by: David Greenman, Matt Dillon


# e4ab40bc 15-Jun-1999 Kirk McKusick <mckusick@FreeBSD.org>

Get rid of the global variable rushjob and replace it with a function in
kern/vfs_subr.c named speedup_syncer() which handles the speedup request.
Change the various clients of rushjob to use the new function.


# ccb84588 12-May-1999 Peter Wemm <peter@FreeBSD.org>

Try an fix a couple of dev_t/major/minor etc nits.


# b0eeea20 06-May-1999 Poul-Henning Kamp <phk@FreeBSD.org>

remove b_proc from struct buf, it's (now) unused.

Reviewed by: dillon, bde


# 84c55b38 06-May-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unused fields from struct buf:
b_savekva
b_validoff
b_validend

Reviewed by: dillon, bde


# 4221e284 02-May-1999 Alan Cox <alc@FreeBSD.org>

The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.

The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.

getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.

There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.

Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>


# 0043b437 29-Apr-1999 Alan Cox <alc@FreeBSD.org>

Address a performance problem in getnewbuf:
In heavy-writing situations, QUEUE_LRU can contain a large number
of DELWRI buffers at its head. These buffers must be moved
to the tail if they cannot be written async in order to reduce
the scanning time required to skip past these buffers in later
getnewbuf() calls.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>


# 35871a15 14-Apr-1999 Dmitrij Tejblum <dt@FreeBSD.org>

getnewbuf(): check return value from tsleep(). Interruptible NFS may pass
PCATCH to slpflag.


# b2e2337b 06-Apr-1999 Alan Cox <alc@FreeBSD.org>

Fix a performance problem with the new getnewbuf() code: in an outofspace
condition ( bufspace > hibufspace ), an inappropriate scan of the empty
queue was performed looking for buffer space to free up.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>


# 8d17e694 05-Apr-1999 Julian Elischer <julian@FreeBSD.org>

Catch a case spotted by Tor where files mmapped could leave garbage in the
unallocated parts of the last page when the file ended on a frag
but not a page boundary.
Delimitted by tags PRE_MATT_MMAP_EOF and POST_MATT_MMAP_EOF,
in files alpha/alpha/pmap.c i386/i386/pmap.c nfs/nfs_bio.c vm/pmap.h
vm/vm_page.c vm/vm_page.h vm/vnode_pager.c miscfs/specfs/spec_vnops.c
ufs/ufs/ufs_readwrite.c kern/vfs_bio.c

Submitted by: Matt Dillon <dillon@freebsd.org>
Reviewed by: Alan Cox <alc@freebsd.org>


# 96ebc581 19-Mar-1999 Bruce Evans <bde@FreeBSD.org>

Fixed a serious bug in rev.1.202. getnewbuf() sometimes didn't
initialise bp->b_data. This tended to cause panics for file
systems whose block size is smaller than one page.


# 4ef2094e 11-Mar-1999 Julian Elischer <julian@FreeBSD.org>

Reviewed by: Many at differnt times in differnt parts,
including alan, john, me, luoqi, and kirk
Submitted by: Matt Dillon <dillon@frebsd.org>

This change implements a relatively sophisticated fix to getnewbuf().
There were two problems with getnewbuf(). First, the writerecursion
can lead to a system stack overflow when you have NFS and/or VN
devices in the system. Second, the free/dirty buffer accounting was
completely broken. Not only did the nfs routines blow it trying to
manually account for the buffer state, but the accounting that was
done did not work well with the purpose of their existance: figuring
out when getnewbuf() needs to sleep.

The meat of the change is to kern/vfs_bio.c. The remaining diffs are
all minor except for NFS, which includes both the fixes for bp
interaction AND fixes for a 'biodone(): buffer already done' lockup.
Sys/buf.h also contains a chaining structure which is not used by
this patchset but is used by other patches that are coming soon.
This patch deliniated by tags PRE_MAT_GETBUF and POST_MAT_GETBUF.
(sorry for the missing T matt)


# 850c9afd 02-Mar-1999 Julian Elischer <julian@FreeBSD.org>

Make comment match code.


# 8b3bd423 02-Mar-1999 Julian Elischer <julian@FreeBSD.org>

Remove inapropriate use of VOP_ISLOCKED()
This produced races resulting in panics and filesystem corruptions
under some circumstances.

Reviewed by: luoqi chen <luoqi@freebsd.org>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
Submitted by: Matt Dillon <dillon@freebsd.org>


# d254af07 27-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# 377f9b28 23-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Don't try to calculate B_CACHE for an NFS related bp that has a
> 0 b_validend. This will screw up small-writes, causing lots of
little writes out the network.

We will assume that NFS handles B_CACHE properly.


# fae1f2e0 22-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix an expression parenthesization typo in a conditional. It should not
have any operational effects other then to make the code in question
a little faster. Also added a more involved comment.


# 33ce4218 22-Jan-1999 David Greenman <dg@FreeBSD.org>

Don't throw away the buffer contents on a fatal write error; just mark
the buffer as still being dirty. This isn't a perfect solution, but
throwing away the buffer contents will often result in filesystem
corruption and this solution will at least correctly deal with transient
errors.
Submitted by: Kirk McKusick <mckusick@mckusick.com>


# 8618c644 21-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

The main operational changes are in getblk()'s handling of the
B_DELWRI and B_CACHE flags, fixing a bug that showed up with NFS.
Also, a number of cases where manually inserted code has been removed
and replaced with an inline function call giving us better functional
isolation in the source.


# 1c7c3c6a 21-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

This is a rather large commit that encompasses the new swapper,
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.

Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>


# ba2871b7 19-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Obtained from: Luoqi

Fix NFS file corruption problem introduced in 1.188. The valid range
was not being set properly, causing a later reference to the buffer
to clear the B_CACHE bit.


# a32c99f3 12-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

Silence warnings.


# 219cbf59 09-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

KNFize, by bde.


# 5526d2d9 08-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as
discussed on -hackers.

Introduce 'KASSERT(assertion, ("panic message", args))' for simple
check + panic.

Reviewed by: msmith


# e6ee8e16 22-Dec-1998 Matthew Dillon <dillon@FreeBSD.org>

Adjust some comments to prevent future confusion on the implementation.
Also add a reference to the buf(9) manual page.


# 1a551e98 22-Dec-1998 Luoqi Chen <luoqi@FreeBSD.org>

Correctly handle misaligned VMIO buffer (whose start or end offset in the VM
object are not page aligned). This should fix the mount_msdos panic after a
failed attemp to mount as ffs.

Reviewed By: Matthew Dillon <dillon@apollo.backplane.com>
Archie Cobbs <archie@whistle.com>
Dmitrij Tejblum <dima@tejblum.dnttm.rssi.ru>


# fe523aa1 14-Dec-1998 Matthew Dillon <dillon@FreeBSD.org>

fix intermediate overflow in 'quad = int * int' situation by casting
the arguments to the multiply to a quad equivalent. In this case,
vm_ooffset_t.

Reviewed by: Archie Cobbs <archie@whistle.com>


# 4979978b 07-Dec-1998 Eivind Eklund <eivind@FreeBSD.org>

Fix grouping of statements. This remove a potential panic in the soft
updates code. While I'm here, remove an unintended trigraph.

Reviewed by: Kirk McKusick <kirk@freebsd.org>


# 4f699173 18-Nov-1998 David Greenman <dg@FreeBSD.org>

Closed a very narrow and rare race condition that involved net interrupts,
bio interrupts, and a truncated file that along with the precise alignment
of the planets could result in a page being freed multiple times or a
just-freed page being put onto the inactive queue.


# 40c8cfe5 31-Oct-1998 Peter Wemm <peter@FreeBSD.org>

Use TAILQ macros for clean/dirty block list processing. Set b_xflags
rather than abusing the list next pointer with a magic number.


# 2a78b8d1 30-Oct-1998 David Greenman <dg@FreeBSD.org>

Unwire everything to the inactive queue in order to preserve LRU ordering.


# 0d5a7254 29-Oct-1998 David Greenman <dg@FreeBSD.org>

Fixed editing error. Pointed out by bde.


# 73007561 28-Oct-1998 David Greenman <dg@FreeBSD.org>

Added a second argument, "activate" to the vm_page_unwire() call so that
the caller can select either inactive or active queue to put the page on.


# f5ef029e 25-Oct-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Nitpicking and dusting performed on a train. Removes trivial warnings
about unused variables, labels and other lint.


# 6cde7a16 13-Oct-1998 David Greenman <dg@FreeBSD.org>

Fixed two potentially serious classes of bugs:

1) The vnode pager wasn't properly tracking the file size due to
"size" being page rounded in some cases and not in others.
This sometimes resulted in corrupted files. First noticed by
Terry Lambert.
Fixed by changing the "size" pager_alloc parameter to be a 64bit
byte value (as opposed to a 32bit page index) and changing the
pagers and their callers to deal with this properly.
2) Fixed a bogus type cast in round_page() and trunc_page() that
caused some 64bit offsets and sizes to be scrambled. Removing
the cast required adding casts at a few dozen callers.
There may be problems with other bogus casts in close-by
macros. A quick check seemed to indicate that those were okay,
however.


# ff8fae60 25-Sep-1998 Matthew Dillon <dillon@FreeBSD.org>

PR: kern/7418
Reviewed by: Luoqi Chen <luoqi@watermarkgroup.com>

Fixed problem where write()s can get lost due to buffers flagged B_DELWRI
being improperly released in brelse().


# 10baba4b 25-Sep-1998 Peter Wemm <peter@FreeBSD.org>

Goodbye BOUNCE_BUFFERS, for a hack it has served us well.

The last consumer of this code (the old SCSI system) has left us and
the CAM code does it's own bouncing. The isa dma system has been
doing it's own bouncing for a while too.

Reviewed by: core


# 7ea97031 15-Sep-1998 Justin T. Gibbs <gibbs@FreeBSD.org>

kern_clock.c:
Remove old disk statistics variables.

vfs_bio.c:
Enable bowrite now that B_ORDERED works for all buffer devices.


# 0375c9f2 05-Sep-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Add a new vnode op, VOP_FREEBLKS(), which filesystems can use to inform
device drivers about sectors no longer in use.

Device-drivers receive the call through d_strategy, if they have
D_CANFREE in d_flags.

This allows flash based devices to erase the sectors and avoid
pointlessly carrying them around in compactions.

Reviewed by: Kirk Mckusick, bde
Sponsored by: M-Systems (www.m-sys.com)


# e69763a3 04-Sep-1998 Doug Rabson <dfr@FreeBSD.org>

Cosmetic changes to the PAGE_XXX macros to make them consistent with
the other objects in vm.


# ddae3cb9 28-Aug-1998 Luoqi Chen <luoqi@FreeBSD.org>

Close a race window for getnewbuf() between shared lock holders of the vnode.

Reviewed by: Mike Smith


# 12e14047 25-Aug-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Fix DDBs printing of buf-flags after I changed them yesterday.


# 1d9b3ba1 24-Aug-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Remove the last remaining evidence of B_TAPE.
Reclaim 3 unused bits in b_flags


# 069e9bc1 24-Aug-1998 Doug Rabson <dfr@FreeBSD.org>

Change various syscalls to use size_t arguments instead of u_int.

Add some overflow checks to read/write (from bde).

Change all modifications to vm_page::flags, vm_page::busy, vm_object::flags
and vm_object::paging_in_progress to use operations which are not
interruptable.

Reviewed by: Bruce Evans <bde@zeta.org.au>


# 7032ad10 13-Aug-1998 Doug Rabson <dfr@FreeBSD.org>

Protect all modifications to v_numoutput with splbio().


# d474eaaa 06-Aug-1998 Doug Rabson <dfr@FreeBSD.org>

Protect all modifications to paging_in_progress with splvm(). The i386
managed to avoid corruption of this variable by luck (the compiler used a
memory read-modify-write instruction which wasn't interruptable) but other
architectures cannot.

With this change, I am now able to 'make buildworld' on the alpha (sfx: the
crowd goes wild...)


# 9f14a215 13-Jul-1998 Bruce Evans <bde@FreeBSD.org>

Fixed printf format errors.


# 6deaf84b 07-Jul-1998 Julian Elischer <julian@FreeBSD.org>

Catch a few corner cases where FreeBSD differs enough from BSD 4.4 to
confuse Soft updates..
Should solve several "dangling deps" panics.


# fd5d1124 04-Jul-1998 Julian Elischer <julian@FreeBSD.org>

VOP_STRATEGY grows an (struct vnode *) argument
as the value in b_vp is often not really what you want.
(and needs to be frobbed). more cleanups will follow this.
Reviewed by: Bruce Evans <bde@freebsd.org>


# b1951f40 01-May-1998 Peter Wemm <peter@FreeBSD.org>

vm_page_is_valid() wasn't expecting a large offset argument, it's
expecting a sub-page offset. We were passing the file position,
and vm_page_bits() could do some interesting things when base was
larger PAGE_SIZE.
if (size > PAGE_SIZE - base)
size = PAGE_SIZE - base;
is interesting when (PAGE_SIZE - base) is negative. I could imagine that
this could have interesting consequences for memory page -> device block
bit validation.


# f806d5a2 01-May-1998 Peter Wemm <peter@FreeBSD.org>

Fix one problem with NFSv3 > 2GB file support.

Submitted by: bde


# dc733423 17-Apr-1998 Dag-Erling Smørgrav <des@FreeBSD.org>

Seventy-odd "its" / "it's" typos in comments fixed as per kern/6108.


# c1087c13 15-Apr-1998 Bruce Evans <bde@FreeBSD.org>

Support compiling with `gcc -ansi'.


# f9be8491 26-Mar-1998 John Dyson <dyson@FreeBSD.org>

Correct a problem where buffers might not be zeroed when needed. The
B_MALLOC buffers might not have been properly zeroed.


# 52c64c95 19-Mar-1998 John Dyson <dyson@FreeBSD.org>

In kern_physio.c fix tsleep priority messup.

In vfs_bio.c, remove b_generation count usage,
remove redundant reassignbuf,
remove redundant spl(s),
manage page PG_ZERO flags more correctly,
utilize in invalid value for b_offset until it
is properly initialized. Add asserts
for #ifdef DIAGNOSTIC, when b_offset is
improperly used.
when a process is not performing I/O, and just waiting
on a buffer generally, make the sleep priority
low.
only check page validity in getblk for B_VMIO buffers.

In vfs_cluster, add b_offset asserts, correct pointer calculation
for clustered reads. Improve readability of certain parts of
the code. Remove redundant spl(s).

In vfs_subr, correct usage of vfs_bio_awrite (From Andrew Gallatin
<gallatin@cs.duke.edu>). More vtruncbuf problems fixed.


# 4641c8ac 17-Mar-1998 John Dyson <dyson@FreeBSD.org>

Correct a problem where data OR metadata could be thrown away if a
buffer is grown.


# f1aca9c3 17-Mar-1998 KATO Takenori <kato@FreeBSD.org>

Deleted PC-98 code because (1) machine dependent code should not be in
here, and (2) the flag used in PC-98 code has been assigned to another
purpose.


# bef608bd 15-Mar-1998 John Dyson <dyson@FreeBSD.org>

Some VM improvements, including elimination of alot of Sig-11
problems. Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!

pmap.c:
1) Create an object for kernel page table allocations. This
fixes a bogus allocation method previously used for such, by
grabbing pages from the kernel object, using bogus pindexes.
(This was a code cleanup, and perhaps a minor system stability
issue.)

pmap.c:
2) Pre-set the modify and accessed bits when prudent. This will
decrease bus traffic under certain circumstances.

vfs_bio.c, vfs_cluster.c:
3) Rather than calculating the beginning virtual byte offset
multiple times, stick the offset into the buffer header, so
that the calculated offset can be reused. (Long long multiplies
are often expensive, and this is a probably unmeasurable performance
improvement, and code cleanup.)

vfs_bio.c:
4) Handle write recursion more intelligently (but not perfectly) so
that it is less likely to cause a system panic, and is also
much more robust.

vfs_bio.c:
5) getblk incorrectly wrote out blocks that are incorrectly sized.
The problem is fixed, and writes blocks out ONLY when B_DELWRI
is true.

vfs_bio.c:
6) Check that already constituted buffers have fully valid pages. If
not, then make sure that the B_CACHE bit is not set. (This was
a major source of Sig-11 type problems.)

vfs_bio.c:
7) Fix a potential system deadlock due to an incorrectly specified
sleep priority while waiting for a buffer write operation. The
change that I made opens the system up to serious problems, and
we need to examine the issue of process sleep priorities.

vfs_cluster.c, vfs_bio.c:
8) Make clustered reads work more correctly (and more completely)
when buffers are already constituted, but not fully valid.
(This was another system reliability issue.)

vfs_subr.c, ffs_inode.c:
9) Create a vtruncbuf function, which is used by filesystems that
can truncate files. The vinvalbuf forced a file sync type operation,
while vtruncbuf only invalidates the buffers past the new end of file,
and also invalidates the appropriate pages. (This was a system reliabiliy
and performance issue.)

10) Modify FFS to use vtruncbuf.

vm_object.c:
11) Make the object rundown mechanism for OBJT_VNODE type objects work
more correctly. Included in that fix, create pager entries for
the OBJT_DEAD pager type, so that paging requests that might slip
in during race conditions are properly handled. (This was a system
reliability issue.)

vm_page.c:
12) Make some of the page validation routines be a little less picky
about arguments passed to them. Also, support page invalidation
change the object generation count so that we handle generation
counts a little more robustly.

vm_pageout.c:
13) Further reduce pageout daemon activity when the system doesn't
need help from it. There should be no additional performance
decrease even when the pageout daemon is running. (This was
a significant performance issue.)

vnode_pager.c:
14) Teach the vnode pager to handle race conditions during vnode
deallocations.


# b1897c19 08-Mar-1998 Julian Elischer <julian@FreeBSD.org>

Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman)
Submitted by: Kirk McKusick (mcKusick@mckusick.com)
Obtained from: WHistle development tree


# 8f9110f6 07-Mar-1998 John Dyson <dyson@FreeBSD.org>

This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.

1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.


# a638dbdb 03-Mar-1998 John Dyson <dyson@FreeBSD.org>

Fix a rounding error for the NFS buffer validend.
Submitted by: John W. De Boskey <jwd@unx.sas.com>


# ffc82b0a 28-Feb-1998 John Dyson <dyson@FreeBSD.org>

1) Use a more consistent page wait methodology.
2) Do not unnecessarily force page blocking when paging
pages out.
3) Further improve swap pager performance and correctness,
including fixing the paging in progress deadlock (except
in severe I/O error conditions.)
4) Enable vfs_ioopt=1 as a default.
5) Fix and enable the page prezeroing in SMP mode.

All in all, SMP systems especially should show a significant
improvement in "snappyness."


# c78ab18a 11-Feb-1998 David Greenman <dg@FreeBSD.org>

Fix a && that should be an &.
Reviewed by: "John S. Dyson" <dyson@FreeBSD.ORG>
Submitted by: jwd@unx.sas.com (John W. DeBoskey)


# 303b270b 08-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Staticize.


# 0b08f5f7 05-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Back out DIAGNOSTIC changes.


# 47cfdb16 04-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Turn DIAGNOSTIC into a new-style option.


# eaf13dd7 31-Jan-1998 John Dyson <dyson@FreeBSD.org>

Change the busy page mgmt, so that when pages are freed, they
MUST be PG_BUSY. It is bogus to free a page that isn't busy,
because it is in a state of being "unavailable" when being
freed. The additional advantage is that the page_remove code
has a better cross-check that the page should be busy and
unavailable for other use. There were some minor problems
with the collapse code, and this plugs those subtile "holes."

Also, the vfs_bio code wasn't checking correctly for PG_BUSY
pages. I am going to develop a more consistant scheme for
grabbing pages, busy or otherwise. For now, we are stuck
with the current morass.


# 33b90a70 24-Jan-1998 John Dyson <dyson@FreeBSD.org>

Various NFS fixes:
Make vfs_bio buffer mgmt work better.
Buffers were being used after brelse.
Make nfs_getpages work independently of other NFS
interfaces. This eliminates some difficult
recursion problems and decreases pagefault
overhead.
Remove an erroneous vfs_unbusy_pages.
Fix a reentrancy problem, with nfs_vinvalbuf when
vnode is already being rundown.
Reassignbuf wasn't being called when needed under
certain circumstances.

(Thanks to Bill Paul for help.)


# 50ce7ff4 23-Jan-1998 John Dyson <dyson@FreeBSD.org>

Add better support for larger I/O clusters, including larger physical
I/O. The support is not mature yet, and some of the underlying implementation
needs help. However, support does exist for IDE devices now.


# 2d8acc0f 22-Jan-1998 John Dyson <dyson@FreeBSD.org>

VM level code cleanups.

1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.

This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)

This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)


# 47221757 17-Jan-1998 John Dyson <dyson@FreeBSD.org>

Tie up some loose ends in vnode/object management. Remove an unneeded
config option in pmap. Fix a problem with faulting in pages. Clean-up
some loose ends in swap pager memory management.

The system should be much more stable, but all subtile bugs aren't fixed yet.


# 925a3a41 11-Jan-1998 John Dyson <dyson@FreeBSD.org>

Fix some vnode management problems, and better mgmt of vnode free list.
Fix the UIO optimization code.
Fix an assumption in vm_map_insert regarding allocation of swap pagers.
Fix an spl problem in the collapse handling in vm_object_deallocate.
When pages are freed from vnode objects, and the criteria for putting
the associated vnode onto the free list is reached, either put the
vnode onto the list, or put it onto an interrupt safe version of the
list, for further transfer onto the actual free list.
Some minor syntax changes changing pre-decs, pre-incs to post versions.
Remove a bogus timeout (that I added for debugging) from vn_lock.

PHK will likely still have problems with the vnode list management, and
so do I, but it is better than it was.


# 95e5e988 05-Jan-1998 John Dyson <dyson@FreeBSD.org>

Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.

When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.

When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.

A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.

Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.


# 6d94bea4 22-Dec-1997 John Dyson <dyson@FreeBSD.org>

Improve my copyright.


# f2e6e69d 06-Dec-1997 John Dyson <dyson@FreeBSD.org>

Slight performance improvement, removal of unneeded SPLs.


# 1cd52ec3 05-Dec-1997 Bruce Evans <bde@FreeBSD.org>

Don't include <sys/lock.h> in headers when only `struct simplelock' is
required. Fixed everything that depended on the pollution.


# ab3f7469 02-Dec-1997 Poul-Henning Kamp <phk@FreeBSD.org>

In all such uses of struct buf: 's/b_un.b_addr/b_data/g'


# b4b3edc1 01-Dec-1997 John Dyson <dyson@FreeBSD.org>

Fix a serious problem during resizing buffers where old buffers
address space wasn't being properly reclaimed.
Submitted by: Bruce Evans <bde@freebsd.org>


# 4ced7dd5 23-Nov-1997 John Dyson <dyson@FreeBSD.org>

Avoid manipulating the buffer map at interrupt time by deferring bfreekva
to getnewbuf, and remove from brelse.
Reviewed by: dg@root.com


# 4a11ca4e 07-Nov-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Remove a bunch of variables which were unused both in GENERIC and LINT.

Found by: -Wunused


# cb226aaa 06-Nov-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Move the "retval" (3rd) parameter from all syscall functions and put
it in struct proc instead.

This fixes a boatload of compiler warning, and removes a lot of cruft
from the sources.

I have not removed the /*ARGSUSED*/, they will require some looking at.

libkvm, ps and other userland struct proc frobbing programs will need
recompiled.


# 55b211e3 28-Oct-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# dba3870c 26-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

VFS interior redecoration.

Rename vn_default_error to vop_defaultop all over the place.
Move vn_bwrite from vfs_bio.c to vfs_default.c and call it vop_stdbwrite.
Use vop_null instead of nullop.
Move vop_nopoll from vfs_subr.c to vfs_default.c
Move vop_sharedlock from vfs_subr.c to vfs_default.c
Move vop_nolock from vfs_subr.c to vfs_default.c
Move vop_nounlock from vfs_subr.c to vfs_default.c
Move vop_noislocked from vfs_subr.c to vfs_default.c
Use vop_ebadf instead of *_ebadf.
Add vop_defaultop for getpages on master vnode in MFS.


# a1c995b6 12-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Last major round (Unless Bruce thinks of somthing :-) of malloc changes.

Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.

A couple of finer points by: bde


# 55166637 11-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Distribute and statizice a lot of the malloc M_* types.

Substantial input from: bde


# ab36c067 21-Sep-1997 Justin T. Gibbs <gibbs@FreeBSD.org>

init_main.c subr_autoconf.c:
Add support for "interrupt driven configuration hooks".
A component of the kernel can register a hook, most likely
during auto-configuration, and receive a callback once
interrupt services are available. This callback will occur before
the root and dump devices are configured, so the configuration
task can affect the selection of those two devices or complete
any tasks that need to be performed prior to launching init.
System boot is posponed so long as a hook is registered. The
hook owner is responsible for removing the hook once their task
is complete or the system boot can continue.

kern_acct.c kern_clock.c kern_exit.c kern_synch.c kern_time.c:
Change the interface and implementation for the kernel callout
service. The new implemntaion is based on the work of
Adam M. Costello and George Varghese, published in a technical
report entitled "Redesigning the BSD Callout and Timer Facilities".
The interface used in FreeBSD is a little different than the one
outlined in the paper. The new function prototypes are:

struct callout_handle timeout(void (*func)(void *),
void *arg, int ticks);

void untimeout(void (*func)(void *), void *arg,
struct callout_handle handle);

If a client wishes to remove a timeout, it must store the
callout_handle returned by timeout and pass it to untimeout.

The new implementation gives 0(1) insert and removal of callouts
making this interface scale well even for applications that
keep 100s of callouts outstanding.

See the updated timeout.9 man page for more details.


# 804cd17e 20-Sep-1997 John Dyson <dyson@FreeBSD.org>

Re-institute a bugfix in allocation of anonymous buffer memory.


# c1f95f13 10-Sep-1997 Poul-Henning Kamp <phk@FreeBSD.org>

The patch is needed in order to not throw away unmodified
local filesystem metadata at the first brelse call when the
block device vnode has v_tag set to VT_NFS.

Reviewed by: phk
Submitted by: Tor Egge <tegge@idi.ntnu.no>


# 2d85d0df 07-Sep-1997 Bruce Evans <bde@FreeBSD.org>

Some staticized variables were still declared to be extern.


# a5db4bf4 25-Aug-1997 John Dyson <dyson@FreeBSD.org>

Back out some incorrect changes that was worse than the original bug.


# 745b8423 20-Aug-1997 John Dyson <dyson@FreeBSD.org>

Some corrections to the anonymous page managment.
Submitted by: Peter Chen <pmchen@eecs.umich.edu>


# c0ecffb9 09-Aug-1997 John Dyson <dyson@FreeBSD.org>

Modify the scheduling policy to take into account disk I/O waits
as chargeable CPU usage. This should mitigate the problem of processes
doing disk I/O hogging the CPU. Various users have reported the
problem, and test code shows that the problem should now be gone.


# 6b195d32 15-Jun-1997 John Dyson <dyson@FreeBSD.org>

Fix a problem with the VN device. Specifically, the VN device can
cause a problem of spiraling death due to buffer resource limitations.
The vfs_bio code in general had little ability to handle buffer resource
management, and now it does. Also, there are a lot more knobs for tuning the
vfs_bio code now. The knobs came free because of the need that there
always be some immediately available buffers (non-delayed or locked) for
use. Note that the buffer cache code is much less likely to get bogged
down with lots of delayed writes, even more so than before.


# bad324ca 13-Jun-1997 Bruce Evans <bde@FreeBSD.org>

Fixed livelock in getnewbuf().

It is possible for multiple process to sleep concurrently waiting
for a buffer. When the buffer shortage is a shortage of space but
not a shortage of buffer headers, the processes took turns creating
empty buffers and waking each other to advertise the brelse() of
the empties; progress was never made because tsleep() always found
another high-priority process to run and everything was done at
splbio(), so vfs_update never had a chance to flush delayed writes,
not to mention that i/o never had a chance to complete.

The problem seems to be rare in practice, but it can easily be
reproduced by misusing block devices, at least for sufficently slow
devices on machines with a sufficiently small buffer cache. E.g.,
`tar cvf /dev/fd0 /kernel' on an 8MB system with no disk in fd0
causes the problem quickly; the same command with a disk in fd0
causes the problem not quite as quickly; and people have reported
problems newfs'ing file systems on block devices.

Block devices only cause this problem indirectly. They are pessimized
for time and space, and the space pessimization causes the shortage
(it manifests as internal fragmentation in buffer_map).

This should be fixed in 2.2.


# e90b93a1 06-Jun-1997 Doug Rabson <dfr@FreeBSD.org>

Don't throw NFS B_DELWRI buffers back to the vm system in brelse.
Make sure that b_validoff..b_validend is at least as big as
b_dirtyoff..b_dirtyend.


# 501338ca 03-Jun-1997 Doug Rabson <dfr@FreeBSD.org>

Fix some performance problems with the NFS mmap fixes.


# bc3718bb 30-May-1997 Doug Rabson <dfr@FreeBSD.org>

The previous fix didn't work properly for small block size filesystems,
which caused very slow file access for cd9660 and some ext2fs filesystems.

Reviewed by: bde


# 32ad9cb5 19-May-1997 Doug Rabson <dfr@FreeBSD.org>

Fix a few bugs with NFS and mmap caused by NFS' use of b_validoff
and b_validend. The changes to vfs_bio.c are a bit ugly but hopefully
can be tidied up later by a slight redesign.

PR: kern/2573, kern/2754, kern/3046 (possibly)
Reviewed by: dyson


# 8eea4d3c 10-May-1997 Joerg Wunsch <joerg@FreeBSD.org>

Add a DDB command `show buffer', to display a struct buf. It's impossible
to display everything, so i've chosen a small subset. Add more to this as
you think seems useful.


# 95395ca1 12-Apr-1997 John Dyson <dyson@FreeBSD.org>

Improve the buffer cache memory policy by moving pages over to the
cache queue more often. The pageout daemon had to be waken up
more often than necessary since pages were not put on the
cache queue, when they should have been.
Submitted by: David Greenman <dg@freebsd.org>


# 3f39dbc5 01-Apr-1997 Bruce Evans <bde@FreeBSD.org>

Removed potentially harmful garbage <vm/lock.h> and fixed bogus
use of it. It was actually harmless because the use was null due
to fortuitous include orders and identical (wrong) idempotency
macros.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 16a02c11 15-Jan-1997 Bruce Evans <bde@FreeBSD.org>

Removed redundant spl0()'s from kernel processes. They were work-arounds
for a bug in fork().


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 3596818b 04-Dec-1996 John Dyson <dyson@FreeBSD.org>

Clean-up of the new buffer kva allocation code. Also, there was an
error in the !BOUNCE_BUFFERS case.


# 621d520e 01-Dec-1996 John Dyson <dyson@FreeBSD.org>

Fix a problem with the new buffer_map management code. Additionally,
decrease the size of buffer_map to approx 2/3 of what it used to be
(buffer_map can be smaller now.) The original commit of these changes
increased the size of buffer_map to the point where the system would
not boot on large systems -- now large systems with large caches will
have even less problems than before.


# 09e0c6cc 30-Nov-1996 John Dyson <dyson@FreeBSD.org>

Implement a new totally dynamic (up to MAXPHYS) buffer kva allocation
scheme. Additionally, add the capability for checking for unexpected
kernel page faults. The maximum amount of kva space for buffers hasn't
been decreased from where it is, but it will now be possible to do so.

This scheme manages the kva space similar to the buffers themselves. If
there isn't enough kva space because of usage or fragementation, buffers
will be reclaimed until a buffer allocation is successful. This scheme
should be very resistant to fragmentation problems until/if the LFS code
is fixed and uses the bogus buffer locking scheme -- but a 'fixed' LFS
is not likely to use such a scheme.

Now there should be NO problem allocating buffers up to MAXPHYS.


# 71a57427 27-Nov-1996 John Dyson <dyson@FreeBSD.org>

Potentially fix a problem, whereby MSDOSFS can request buffers
larger than the vfs layer can provide. We now automatically support
32K clusters if MSDOSFS is installed, and panic if a filesystem tries
to allocate a buffer larger than MAXBSIZE.

This commit is a result of some "prodding" by BDE.


# 9970cd37 16-Nov-1996 John Dyson <dyson@FreeBSD.org>

Improve the caching of small files like directories, while not
substantially increasing buffer space. Specifically, we double
the number of buffers, but allocate only half the amount of memory
per buffer. Note that VDIR files aren't cached unless instantiated
in a buffer. This will significantly improve caching.


# 402bcb96 16-Oct-1996 John Dyson <dyson@FreeBSD.org>

Fix a problem that could cause msync (or many other things) to deadlock.
The heuristic for managment of memory backing the buffer cache was
nice, but didn't work due to some architectural problems. Simplify
and improve the algorithm.


# ffe2522e 06-Oct-1996 John Dyson <dyson@FreeBSD.org>

Fix 4 problems:
Major: When blocking occurs in allocbuf() for VMIO files,
excess wire counts could accumulate.
Major: Pages are incorrectly accumulated into the physical
buffer for clustered reads. This happens when bogus
page is needed.
Minor: When reclaiming buffers, the async flag on the buffer
needs to be zero, or the reclaim is not optimal.
Minor: The age flag should be cleared, if a buffer is wanted.


# 08c2c9dd 19-Sep-1996 John Dyson <dyson@FreeBSD.org>

Fix an spl window, a page manipulation at interrupt time that was
incorrect, and correct the support for B_ORDERED. The spl window
fix was from Peter Wemm, and his questions led me to find the problem with
the interrupt time page manipulation.


# f9da2540 18-Sep-1996 John Dyson <dyson@FreeBSD.org>

Add needed spl protection, and some minor cleanups in vfs_vmio_release.
Submitted by: Peter Wemm <peter@spinner.dialix.com> and me.


# 8fdfa820 13-Sep-1996 John Dyson <dyson@FreeBSD.org>

Clean up some more problems with freeing busy or wired pages. The
vfs_bio code was not waiting properly for page state until manipulating
it.


# 9fc1279b 12-Sep-1996 John Dyson <dyson@FreeBSD.org>

A modification that allows the driver strategy to modify the
B_ASYNC flag broke things pretty bad (freeing buffer already on
queue or other wierd buffer queue errors.) The broken code is
left in commented out, but this makes the problem go away for
now.


# 5070c7f8 08-Sep-1996 John Dyson <dyson@FreeBSD.org>

Addition of page coloring support. Various levels of coloring are afforded.
The default level works with minimal overhead, but one can also enable
full, efficient use of a 512K cache. (Parameters can be generated
to support arbitrary cache sizes also.)


# 0b64164f 05-Sep-1996 Justin T. Gibbs <gibbs@FreeBSD.org>

Add bowrite.

Bowrite guarantees that buffers queued after a call to bowrite will
be written after the specified buffer (on a particular device).
Bowrite does this either by taking advantage of hardware ordering support
(e.g. tagged queueing on SCSI devices) or resorting to a synchronous write.


# 6476c0d2 21-Aug-1996 John Dyson <dyson@FreeBSD.org>

Even though this looks like it, this is not a complex code change.
The interface into the "VMIO" system has changed to be more consistant
and robust. Essentially, it is now no longer necessary to call vn_open
to get merged VM/Buffer cache operation, and exceptional conditions
such as merged operation of VBLK devices is simpler and more correct.

This code corrects a potentially large set of problems including the
problems with ktrace output and loaded systems, file create/deletes,
etc.

Most of the changes to NFS are cosmetic and name changes, eliminating
a layer of subroutine calls. The direct calls to vput/vrele have
been re-instituted for better cross platform compatibility.

Reviewed by: davidg


# d1c4c866 04-Aug-1996 Poul-Henning Kamp <phk@FreeBSD.org>

Add separate kmalloc classes for BIO buffers and Ktrace info.


# 7c818168 29-Jun-1996 David Greenman <dg@FreeBSD.org>

Fixed a major bug that caused various pmap related panics, hangs, and reboots.

The i386 pmap module uses a special area of kernel virtual memory for mapping
of page tables pages when it needs to modify another process's virtual
address space. It's called the 'alternate page table map'. There is only one
of them and it's expected that only one process will be using it at once and
that the operation is atomic.
When the merged VM/buffer cache was implemented over a year ago, it became
necessary to rundown VM pages at I/O completion. The unfortunate and
unforeseen side effect of this is that pmap functions are now called at bio
interrupt time. If there happend to be a process using the alternate page
table map when this I/O completion occurred, it was possible for a different
process's address space to be switched into the alternate page table map -
leaving the current pmap process with the wrong address space mapped when
the interrupt completed. This resulted in BAD things happening like pages
being mapped or removed from the wrong address space, etc.. Since a very
common case of a process modifying another process's address space is during
fork when the kernel stack is inserted, one of the most common manifestations
of this bug was the kernel stack not being mapped properly, resulting in a
silent hang or reboot. This made it VERY difficult to troubleshoot this bug
(I've been trying to figure out the cause of this for >6 months). Fortunately,
the set of conditions that must be true before this problem occurs is
sufficiently rare enough that most people never saw the bug occur. As I/O
rates increase, however, so does the frequency of the crashes. This problem
used to kill wcarchive about every 10 days, but in more recent times when
the traffic exceeded >100GB/day, the machine could barely manage 6 hours of
uptime.
The fix is to make certain that no process has the pages mapped that are
involved in the I/O, before the I/O is started. The pages are made busy, so
no process will be able to map them, either, until the I/O has finished.
This side-steps the issue by still allowing the pmap functions to be called
at interrupt time, but also assuring that the alternate page table map won't
be switched.
Unfortunately, this appears to not be the only cause of this problem. :-(

Reviewed by: dyson


# ad63a118 14-Jun-1996 Satoshi Asami <asami@FreeBSD.org>

The Great PC98 Merge.

All new code is "#ifdef PC98"ed so this should make no difference to
PC/AT (and its clones) users.

Ok'd by: core
Submitted by: FreeBSD(98) development team


# 268e9c53 30-May-1996 John Dyson <dyson@FreeBSD.org>

Keep brelse from freeing busy pages.


# 301051a0 23-May-1996 John Dyson <dyson@FreeBSD.org>

Make sure that we don't place a busy or held page onto the PQ_CACHE queue.


# b18bfc3d 17-May-1996 John Dyson <dyson@FreeBSD.org>

This set of commits to the VM system does the following, and contain
contributions or ideas from Stephen McKay <syssgm@devetir.qld.gov.au>,
Alan Cox <alc@cs.rice.edu>, David Greenman <davidg@freebsd.org> and me:

More usage of the TAILQ macros. Additional minor fix to queue.h.
Performance enhancements to the pageout daemon.
Addition of a wait in the case that the pageout daemon
has to run immediately.
Slightly modify the pageout algorithm.
Significant revamp of the pmap/fork code:
1) PTE's and UPAGES's are NO LONGER in the process's map.
2) PTE's and UPAGES's reside in their own objects.
3) TOTAL elimination of recursive page table pagefaults.
4) The page directory now resides in the PTE object.
5) Implemented pmap_copy, thereby speeding up fork time.
6) Changed the pv entries so that the head is a pointer
and not an entire entry.
7) Significant cleanup of pmap_protect, and pmap_remove.
8) Removed significant amounts of machine dependent
fork code from vm_glue. Pushed much of that code into
the machine dependent pmap module.
9) Support more completely the reuse of already zeroed
pages (Page table pages and page directories) as being
already zeroed.
Performance and code cleanups in vm_map:
1) Improved and simplified allocation of map entries.
2) Improved vm_map_copy code.
3) Corrected some minor problems in the simplify code.
Implemented splvm (combo of splbio and splimp.) The VM code now
seldom uses splhigh.
Improved the speed of and simplified kmem_malloc.
Minor mod to vm_fault to avoid using pre-zeroed pages in the case
of objects with backing objects along with the already
existant condition of having a vnode. (If there is a backing
object, there will likely be a COW... With a COW, it isn't
necessary to start with a pre-zeroed page.)
Minor reorg of source to perhaps improve locality of ref.


# aa8de40a 03-May-1996 Poul-Henning Kamp <phk@FreeBSD.org>

Another sweep over the pmap/vm macros, this time with more focus on
the usage. I'm not satisfied with the naming, but now at least there is
less bogus stuff around.


# 18ff6494 08-Mar-1996 John Dyson <dyson@FreeBSD.org>

Correct handling of dirty pages in I/O buffers. The case where pages
residing in a buffer that had been dirtied by a process was being
handled incorrectly. The pages were mistakenly placed into the
cache queue. This would likely have the effect of mmaped page modifications
being lost when I/O system calls were being used simultaneously to
the same locations in a file.
Submitted by: davidg


# c735bcf5 02-Mar-1996 John Dyson <dyson@FreeBSD.org>

Fix the buffer queue problem differently. The previous fix could panic
with a buffer not on queue panic.


# 6538dda3 01-Mar-1996 John Dyson <dyson@FreeBSD.org>

1) Fix a bug that a buffer is removed from a queue, but the
queue type is not set to QUEUE_NONE. This appears to have
caused a hang bug that has been lurking.
2) Fix bugs that brelse'ing locked buffers do not "free" them, but the
code assumes so. This can cause hangs when LFS is used.
3) Use malloced memory for directories when applicable. The amount
of malloced memory is seriously limited, but should decrease the
amount of memory used by an average directory to 1/4 - 1/2 previous.
This capability is fully tunable. (Note that there is no config
parameter, and might never be.)
4) Bias slightly the buffer cache usage towards non-VMIO buffers. Since
the data in VMIO buffers is not lost when the buffer is reclaimed, this
will help performance. This is adjustable also.


# 91477adc 01-Mar-1996 John Dyson <dyson@FreeBSD.org>

Enable VMIO for non-VDIR metadata and block device.


# bd7e5f99 18-Jan-1996 John Dyson <dyson@FreeBSD.org>

Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.


# 7548aeb5 06-Jan-1996 David Greenman <dg@FreeBSD.org>

Print out the queue index if it's found to be inconsistent.


# 2199f986 06-Jan-1996 David Greenman <dg@FreeBSD.org>

Rework vm_hold_{load,free}_pages to calculate an index once and use that.
At the same time, be sure to page-truncate bp->b_data so that the result
of the calculation isn't negative.


# 8890984d 05-Jan-1996 Garrett Wollman <wollman@FreeBSD.org>

Convert BOUNCE_BUFFERS and BOUNCEPAGES to new option scheme.


# a5782ecc 03-Jan-1996 David Greenman <dg@FreeBSD.org>

Fixed minor struct cred leak. Discovered while looking for the opposite
condition - too many frees, which has yet to be found.

Reviewed by: dyson


# 87b6de2b 14-Dec-1995 Poul-Henning Kamp <phk@FreeBSD.org>

A Major staticize sweep. Generates a couple of warnings that I'll deal
with later.
A number of unused vars removed.
A number of unused procs removed or #ifdefed.


# 1cdb6048 12-Dec-1995 John Dyson <dyson@FreeBSD.org>

Fix a problem that was caused by new (partial) support for merged cache
metadata and VBLK type devices. The code is currently mostly disabled,
and a work-around has been added to disabled attempted clustered writes
for VBLK type device buffers. Clustered write of meta-data is currently
a work in progress.


# beb2f78f 11-Dec-1995 John Dyson <dyson@FreeBSD.org>

This should have fixed some conditions that could cause the
"getblk" hang. The B_WANTED flag was being cleared gratuitously,
also the optimization of gbincore for ignoring the B_INVAL flag was
incorrect. There is no place in the code where buffers are on the
hash list that are B_INVAL and not B_BUSY.


# a316d390 10-Dec-1995 John Dyson <dyson@FreeBSD.org>

Changes to support 1Tb filesizes. Pages are now named by an
(object,index) pair instead of (object,offset) pair.


# efeaf95a 06-Dec-1995 David Greenman <dg@FreeBSD.org>

Untangled the vm.h include file spaghetti.


# 946bb7a2 04-Dec-1995 Poul-Henning Kamp <phk@FreeBSD.org>

A major sweep over the sysctl stuff.

Move a lot of variables home to their own code (In good time before xmas :-)

Introduce the string descrition of format.

Add a couple more functions to poke into these marvels, while I try to
decide what the correct interface should look like.

Next is adding vars on the fly, and sysctl looking at them too.

Removed a tine bit of defunct and #ifdefed notused code in swapgeneric.


# 98d93822 02-Dec-1995 Bruce Evans <bde@FreeBSD.org>

Completed function declarations and/or added prototypes.


# d841aaa7 02-Dec-1995 Bruce Evans <bde@FreeBSD.org>

Finished (?) cleaning up sysinit stuff.


# 5fe17eeb 19-Nov-1995 John Dyson <dyson@FreeBSD.org>

General fixes to the vfs clustring code:

1) Make cluster buffer list be a non-malloced chain. This eliminates
yet another 'evil' M_WAITOK and generally cleans up the code.
2) Fix write clustering for ext2fs. It was just broken. Also, ffs
clustering had an efficiency problem that more bawrites were happening
than should have been.
3) Make changes to buf.h to support the above, plus remove b_pfcent
at the request of David Greenman.
Reviewed by: davidg (partially)


# 0ada0d67 18-Nov-1995 John Dyson <dyson@FreeBSD.org>

Added a missing splx(s).


# aef922f5 05-Nov-1995 John Dyson <dyson@FreeBSD.org>

Greatly simplify the msync code. Eliminate complications in vm_pageout
for msyncing. Remove a bug that manifests itself primarily on NFS
(the dirty range on the buffers is not set on msync.)


# a98ca469 29-Oct-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Second batch of cleanup changes.
This time mostly making a lot of things static and some unused
variables here and there.


# 6d875bf5 19-Oct-1995 John Dyson <dyson@FreeBSD.org>

If we clear the B_CACHE flag because a buffer isn't composed fully of
valid bytes, we must also clear the B_DONE flag. Some filesystems
depend on this (incl NFS) and is probably the cause of the biodone
error and subsequent crash. Anyway this change needs to be made.


# ad7507e2 07-Oct-1995 Steven Wallace <swallace@FreeBSD.org>

Remove prototype definitions from <sys/systm.h>.
Prototypes are located in <sys/sysproto.h>.

Add appropriate #include <sys/sysproto.h> to files that needed
protos from systm.h.

Add structure definitions to appropriate files that relied on sys/systm.h,
right before system call definition, as in the rest of the kernel source.

In kern_prot.c, instead of using the dummy structure "args", create
individual dummy structures named <syscall>_args. This makes
life easier for prototype generation.


# 7329854a 30-Sep-1995 David Greenman <dg@FreeBSD.org>

Two critical bugfixes:

1) "obj" was't initialized properly, resulting in an important vm_page_lookup
always failing (resulting in a panic).
2) busy pages could be put on the cache queue or freed (resulting in a panic).


# 164fd96f 23-Sep-1995 John Dyson <dyson@FreeBSD.org>

These changes fix a bug in the clustering code that I made worse when adding
support for EXT2FS. Note that the Sig-11 problems appear to be caused by
this, but there is still probably an underlying VM problem that let this
clustering bug cause vnode objects to appear to be corrupted.

The direct manifestation of this bug would have been severely mis-read
files. It is possible that processes would Sig-11 on very damaged
input files and might explain the mysterious differences in system
behaviour when phk's malloc is being used.


# 4590fd3a 09-Sep-1995 David Greenman <dg@FreeBSD.org>

Fixed init functions argument type - caddr_t -> void *. Fixed a couple of
compiler warnings.


# c83ebe77 03-Sep-1995 John Dyson <dyson@FreeBSD.org>

Added VOP_GETPAGES/VOP_PUTPAGES and also the "backwards" block count
for VOP_BMAP. Updated affected filesystems...


# 8c601f7d 03-Sep-1995 John Dyson <dyson@FreeBSD.org>

Improvements to the cluster code, minor vfs_bio efficiency:
Better performance -- more aggressive read-ahead
under certain circumstanses.

Mods to support clustering on small
( < PAGE_SIZE) block size filesystems (e.g. ext2fs,
msdosfs.)


# 2b14f991 28-Aug-1995 Julian Elischer <julian@FreeBSD.org>

Reviewed by: julian with quick glances by bruce and others
Submitted by: terry (terry lambert)
This is a composite of 3 patch sets submitted by terry.
they are:
New low-level init code that supports loadbal modules better
some cleanups in the namei code to help terry in 16-bit character support
some changes to the mount-root code to make it a little more
modular..

NOTE: mounting root off cdrom or NFS MIGHT be broken as I haven't been able
to test those cases..

certainly mounting root of disk still works just fine..
mfs should work but is untested. (tomorrows task)

The low level init stuff includes a total rewrite of init_main.c
to make it possible for new modules to have an init phase by simply
adding an entry to a TEXT_SET (or is it DATA_SET) list. thus a new module can
be added to the kernel without editing any other files other than the
'files' file.


# 9f95e53c 24-Aug-1995 David Greenman <dg@FreeBSD.org>

Another minor optimization, this time to incore().


# ff3aaf25 24-Aug-1995 David Greenman <dg@FreeBSD.org>

Minor optimization.


# d63abe41 05-Aug-1995 David Greenman <dg@FreeBSD.org>

Resize both VMIO and non-VMIO buffers if the size changes.


# 28f8db14 29-Jul-1995 Bruce Evans <bde@FreeBSD.org>

Eliminate sloppy common-style declarations. There should be none left for
the LINT configuation.


# 23953f95 24-Jul-1995 David Greenman <dg@FreeBSD.org>

Killed bogus casts in tsleep/wakeup calls.


# cd901555 24-Jul-1995 David Greenman <dg@FreeBSD.org>

Fixed broken offset use in vfs_unbusy_pages() which resulted in several
different types of panics/inconsistencies with NFS clients.
Cleared PG_WANTED where appropriate.
Added checks for buffer busy in allocbuf and biodone.

Reviewed by: John Dyson


# 23f76268 23-Jul-1995 David Greenman <dg@FreeBSD.org>

Panic if no object in biodone. Slightly optimized allocbuf() again.


# be49bd16 23-Jul-1995 David Greenman <dg@FreeBSD.org>

Added some additional diagnostic information output when panicing in
biodone().


# 31de6175 23-Jul-1995 David Greenman <dg@FreeBSD.org>

Fixed two cases where some parans were missing, resulting in some bogus
logic. Slightly simplified allocbuf().


# 44918dfe 20-Jul-1995 David Greenman <dg@FreeBSD.org>

Re-lookup the buffer if the vnode isn't locked. The previous check for
VBLK vnodes isn't adequate since all NFS nodes aren't locked, either. The
result is a race condition that would lead to duplicate buffers at the
same block offset.

Submitted by: John Dyson


# 1ce781c3 17-Jul-1995 David Greenman <dg@FreeBSD.org>

Fixed "bufspace" calculation. It was lossy in some circumstances of the
buffer resizing and caused a "newbuf" deadlock.

Reviewed by: John Dyson & David Greenman
Submitted by: Peter Wemm


# 8393c48a 15-Jul-1995 David Greenman <dg@FreeBSD.org>

Resize buffers if they aren't the correct size. Several months ago we
made a change to NFS that caused buffers at EOF to be variable size. This
had the undesired side-effect of breaking delayed writes on NFS. This
fixes it.

Submitted by: John Dyson


# aa2cabb9 27-Jun-1995 David Greenman <dg@FreeBSD.org>

1) Converted v_vmdata to v_object.
2) Removed unnecessary vm_object_lookup()/pager_cache(object, TRUE) pairs
after vnode_pager_alloc() calls - the object is already guaranteed to be
persistent.
3) Removed some gratuitous casts.


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# 61f5d510 21-May-1995 David Greenman <dg@FreeBSD.org>

Changes to fix the following bugs:

1) Files weren't properly synced on filesystems other than UFS. In some
cases, this lead to lost data. Most likely would be noticed on NFS.
The fix is to make the VM page sync/object_clean general rather than
in each filesystem.
2) Mixing regular and mmaped file I/O on NFS was very broken. It caused
chunks of files to end up as zeroes rather than the intended contents.
The fix was to fix several race conditions and to kludge up the
"b_dirtyoff" and "b_dirtyend" that NFS relies upon - paying attention
to page modifications that occurred via the mmapping.

Reviewed by: David Greenman
Submitted by: John Dyson


# b2b795f0 11-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Fix -Wformat warnings from LINT kernel.


# fc3d49a9 29-Apr-1995 David Greenman <dg@FreeBSD.org>

Check for curproc != NULL before dereferencing it.


# 5baf11ce 15-Apr-1995 David Greenman <dg@FreeBSD.org>

Removed unused & empty bufstats() function.


# cece489d 16-Apr-1995 David Greenman <dg@FreeBSD.org>

Killed gratuitous b_vp=NULL in bufinit. The entire buffer is already
bzero()'d.


# d94a4c04 15-Apr-1995 David Greenman <dg@FreeBSD.org>

1) Check for curproc != NULL in bread/bwrite. John convinced me that this
is necessary in order for panic+sync to work. Will also gloss over a panic
that Jordan was having with the install floppies that remains unexplainable.
2) Handle "bogus_page" a little better.
3) Set page protection to VM_PROT_NONE if the entire page has become !valid.

Submitted by: John Dyson (2&3), me (1).


# 213fd1b6 09-Apr-1995 David Greenman <dg@FreeBSD.org>

Changes from John Dyson and myself:

Fixed remaining known bugs in the buffer IO and VM system.

vfs_bio.c:
Fixed some race conditions and locking bugs. Improved performance
by removing some (now) unnecessary code and fixing some broken
logic.
Fixed process accounting of # of FS outputs.
Properly handle NFS interrupts (B_EINTR).

(various)
Replaced calls to clrbuf() with calls to an optimized routine
call vfs_bio_clrbuf().

(various FS sync)
Sync out modified vnode_pager backed pages.

ffs_vnops.c:
Do two passes: Sync out file data first, then indirect blocks.

vm_fault.c:
Fixed deadly embrace caused by acquiring locks in the wrong order.

vnode_pager.c:
Changed to use buffer I/O system for writing out modified pages. This
should fix the problem with the modification date previous not getting
updated. Also dramatically simplifies the code. Note that this is
going to change in the future and be implemented via VOP_PUTPAGES().

vm_object.c:
Fixed a pile of bugs related to cleaning (vnode) objects. The performance
of vm_object_page_clean() is terrible when dealing with huge objects,
but this will change when we implement a binary tree to keep the object
pages sorted.

vm_pageout.c:
Fixed broken clustering of pageouts. Fixed race conditions and other
lockup style bugs in the scanning of pages. Improved performance.


# 6e14d964 26-Mar-1995 David Greenman <dg@FreeBSD.org>

Removed some redundant 'vmio' checks.


# f57459b6 26-Mar-1995 David Greenman <dg@FreeBSD.org>

Removed third arg (vmio) to allocbuf() that was added with the original
merged cache changes, and figure it out based on the B_VMIO buffer flag.
Fixes a problem where delayed write VMIO buffers would sometimes get
recopied into kernel-alloced memory.

Submitted by: John Dyson


# b5e8ce9f 16-Mar-1995 Bruce Evans <bde@FreeBSD.org>

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# 5dcf3090 07-Mar-1995 David Greenman <dg@FreeBSD.org>

Removed most of the special policy regarding the seperation of VMIO and
dir/metadata buffers as it seems to have anomolous effects.


# 9bd16971 04-Mar-1995 David Greenman <dg@FreeBSD.org>

Added some more of John's "anti-chatter" fixes - set the page activation
count to 0 after activating the page; the previous behavior biased the
pages too high in some cases.

Submitted by: John Dyson


# 22470903 03-Mar-1995 David Greenman <dg@FreeBSD.org>

Fixes from John Dyson to work around vnode lock hang. Basically, remove
the VOP_BMAP calls, and add one to bdwrite.

Submitted by: John Dyson


# fde2cdc4 01-Mar-1995 David Greenman <dg@FreeBSD.org>

Various changes from John and myself that do the following:

New functions create - vm_object_pip_wakeup and pagedaemon_wakeup that
are used to reduce the actual number of wakeups.
New function vm_page_protect which is used in conjuction with some new
page flags to reduce the number of calls to pmap_page_protect.
Minor changes to reduce unnecessary spl nesting.
Rewrote vm_page_alloc() to improve readability.
Various other mostly cosmetic changes.


# 44941008 24-Feb-1995 David Greenman <dg@FreeBSD.org>

Fixed thrashing buffer problem.

Submitted by: John Dyson


# 701d4580 22-Feb-1995 David Greenman <dg@FreeBSD.org>

Added some code to make sure that buffers associated with directories and
metadata aren't thrashed by regular file I/O.
Added mechanism to limit the amount of outstanding I/O on a given vnode.
Pagedaemon wakeup policy changed to skew priority a little in favor of
file caching.
Slight code reorganization to improve clarity.
Added a few more comments.

Submitted by: John Dyson


# 7a539444 22-Feb-1995 David Greenman <dg@FreeBSD.org>

Only do object paging_in_progress wakeups if someone is waiting on this
condition.
Added some comments.

Submitted by: John Dyson


# 0e655886 17-Feb-1995 David Greenman <dg@FreeBSD.org>

Only clear B_VMIO in brelse() - a bunch of special processing is required
whenever this happens, and that wasn't occurring in some cases.


# b82c50c4 02-Feb-1995 David Greenman <dg@FreeBSD.org>

Make B_NOCACHE and B_INVAL buffers work correctly - throw away the data in
the page cache.

Submitted by: John Dyson


# 6a1c735d 25-Jan-1995 David Greenman <dg@FreeBSD.org>

Fix problem with freeing busy pages reported by Nick Sayer.

Submitted by: John Dyson


# 9532143a 24-Jan-1995 David Greenman <dg@FreeBSD.org>

Fixed a variety of deadlock and panic bugs, removed the bypass code, and
implemented the ability to limit bufferspace by memory consumed. (vfs_bio.c)
Fixed recently introduced bugs that caused extra I/O to happen in some
cases. (vfs_cluster.c)

Submitted by: John Dyson


# 60efec1d 20-Jan-1995 Andrey A. Chernov <ache@FreeBSD.org>

Restore original fix from ohki, not check m for NULL it is already done
in the code above.
Submitted by: ohki@gssm.otsuka.tsukuba.ac.jp


# f76c8e8c 20-Jan-1995 Andrey A. Chernov <ache@FreeBSD.org>

Change if (m->valid == 0) to if (m && m->valid == 0)


# fc042b69 20-Jan-1995 Bill Paul <wpaul@FreeBSD.org>

Submitted by: ohki@gssm.otsuka.tsukuba.ac.jp
When using cp to copy a file under the following circumstanes:

- original file in on an NFS filesystem
- destination file is on the same NFS filesystem
- the file is less than 8Mbytes in size
- the file is larger than 65536 bytes in size

the cp process can get frozen in device-wait and never wake up (cp uses
mmap() in this case).
A small change to allocbuf() fixes this.


# 761dd667 15-Jan-1995 David Greenman <dg@FreeBSD.org>

Attempt to close a hole using splhigh/splx. There still appears to be a
serious one in the same area that I don't have time to fix.


# c3d05da5 10-Jan-1995 David Greenman <dg@FreeBSD.org>

MFS doesn't bother to associate a struct mount with the vnode...so work
around this by not trying to cluster this type of I/O.

Submitted by: John Dyson


# 5943df83 10-Jan-1995 David Greenman <dg@FreeBSD.org>

PG_FAKE is no longer used - so don't bother to clear it.


# 480dff54 10-Jan-1995 David Greenman <dg@FreeBSD.org>

Fixed some formatting weirdness that I overlooked in the previous commit.


# 0d94caff 09-Jan-1995 David Greenman <dg@FreeBSD.org>

These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.

The majority of the merged VM/cache work is by John Dyson.

The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.

vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.

vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.

vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.

vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.

vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.

pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.

vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.

proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.

swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.

machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.

machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.

ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.

Submitted by: John Dyson and David Greenman


# e03b612d 23-Oct-1994 David Greenman <dg@FreeBSD.org>

Only VM_WAIT if curproc != pageproc. A deadlock can occur otherwise.

Submitted by: John Dyson


# 08d7d166 18-Oct-1994 David Greenman <dg@FreeBSD.org>

Removed references to bclnlist which we don't use/support/need.


# 8e58bf68 05-Oct-1994 David Greenman <dg@FreeBSD.org>

Stuff object into v_vmdata rather than pager. Not important which at
the moment, but will be in the future. Other changes mostly cosmetic,
but are made for future VMIO considerations.

Submitted by: John Dyson


# 91b1e285 03-Oct-1994 David Greenman <dg@FreeBSD.org>

Commented out anti-paging code as it was found to be the cause of a
buffer deadlock.


# bb56ec4a 25-Sep-1994 Poul-Henning Kamp <phk@FreeBSD.org>

While in the real world, I had a bad case of being swapped out for a lot of
cycles. While waiting there I added a lot of the extra ()'s I have, (I have
never used LISP to any extent). So I compiled the kernel with -Wall and
shut up a lot of "suggest you add ()'s", removed a bunch of unused var's
and added a couple of declarations here and there. Having a lap-top is
highly recommended. My kernel still runs, yell at me if you kernel breaks.


# 9aba88bf 31-Aug-1994 David Greenman <dg@FreeBSD.org>

Rather than exclude bounce buffers support with NOBOUNCE, include it
with BOUNCE_BUFFERS. This is more intuitive, and is better for future
multiplatform support. Added BOUNCE_BUFFERS option to the GENERIC and
LINT kernel config files.


# e66defe8 30-Aug-1994 David Greenman <dg@FreeBSD.org>

Changed to reclaim memory from other buffers to eliminate memory
thrashing.

Submitted by: John Dyson


# f23b4c91 18-Aug-1994 Garrett Wollman <wollman@FreeBSD.org>

Fix up some sloppy coding practices:

- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.

NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.


# 57034e74 08-Aug-1994 Garrett Wollman <wollman@FreeBSD.org>

Run-time configuration of VFS update interval. Old UPDATE_INTERVAL
configuration option is no longer supported.


# 8339815f 07-Aug-1994 David Greenman <dg@FreeBSD.org>

Made pmap_kenter "TLB safe". ...and then removed all the pmap_updates that
are no longer needed because of this.


# 16f62314 06-Aug-1994 David Greenman <dg@FreeBSD.org>

Incorporated post 1.1.5 work from John Dyson. This includes performance
improvements via the new routines pmap_qenter/pmap_qremove and pmap_kenter/
pmap_kremove. These routine allow fast mapping of pages for those
architectures that have "normal" MMUs. Also included is a fix to the
pageout daemon to properly check a queue end condition.

Submitted by: John Dyson


# ce921c90 04-Aug-1994 David Greenman <dg@FreeBSD.org>

Fixed bug that would cause free memory reserves to be depleted and cause a
panic in some cases.
Submitted by: John Dyson


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# 594110fe 26-May-1994 David Greenman <dg@FreeBSD.org>

Moved header definitions to buf.h, and added missing splx() - found
by Johannes Helander.


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources