History log of /netbsd-current/sys/kern/vfs_bio.c
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# 1.303 30-Mar-2022 riastradh

Revert "kern: Sprinkle biglock-slippage assertions."

Got the diagnostic information I needed from this, and it's holding
up releng tests of everything else, so let's back this out until I
need more diagnostics or track down the original source of the
problem.


# 1.302 30-Mar-2022 riastradh

kern: Sprinkle biglock-slippage assertions.

We seem to have a poltergeist that occasionally messes with the
biglock depth, but it's very hard to reproduce and only manifests as
some other CPU spinning out on the kernel lock which is no good for
diagnostics.


Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base thorpej-i2c-spi-conf-base
# 1.301 25-Jul-2021 simonb

If we're only doing a count-only kern.buf sysctl, just return the number
of active members in the pool cache (plus some slop) instead of looking
in all the free buffer list. Should reduce CPU usage for "systat vm"
to << 1% especially for machines with a larger number of buffers.


# 1.300 24-Jul-2021 simonb

Expose KERN_BUFSLOP in <sys/sysctl.h>.


# 1.299 24-Jul-2021 simonb

Pad out the slop for kern.buf based on the passed in element size,
rather than a size of an unrelated struct.


Revision tags: cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-cfargs-base thorpej-futex-base
# 1.298 01-Apr-2021 simonb

branches: 1.298.2;
Add a sysctl hashstat collector for bufhash.


# 1.297 31-Jul-2020 chs

branches: 1.297.2; 1.297.4;
fix the UFS2 extattr truncate code to play nice with wapbl.
also, rather than pull in the FreeBSD V_NORMAL/V_ALT flags to
vinvalbuf() and the buf b_xflags field and BX_ALTDATA flag,
add a binvalbuf() function to invalid a specific buffer
and use that to invalidate the two possible exattr bufs
during IO_EXT truncations.


# 1.296 11-Jun-2020 ad

uvm_availmem(): give it a boolean argument to specify whether a recent
cached value will do, or if the very latest total must be fetched. It can
be called thousands of times a second and fetching the totals impacts not
only the calling LWP but other CPUs doing unrelated activity in the VM
system.


# 1.295 27-Apr-2020 jdolecek

pass B_PHYS|B_RAW also in nestio_setup(), courtesy to e.g. xbd(4), which
wants to know whether the buf came from user space or bio subsystem


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421
# 1.294 20-Apr-2020 ad

Rename buf_syncwait() to vfs_syncwait(), and have it wait on v_numoutput
rather than BC_BUSY. Removes the dependency on bufhash.


Revision tags: bouyer-xenpvh-base1 phil-wifi-20200411
# 1.293 11-Apr-2020 jdolecek

for bmempools set align, not ioff


# 1.292 11-Apr-2020 jdolecek

explicitly use DEV_BSIZE align for all bmempools

this is required for Xen xbd(4) in order to not have to use bounce buffers

the alignment is implicitly provided when POOL_REDZONE is not active,
this change makes it also aligned when POOL_REDZONE _is_ active - that is
when (!KMSAN && (DIAGNOSTIC || KASAN))


# 1.291 10-Apr-2020 ad

Remove buffer reference counting, now that it's safe to destroy b_busy after
waking any waiters.


Revision tags: bouyer-xenpvh-base phil-wifi-20200406
# 1.290 14-Mar-2020 ad

branches: 1.290.2;
- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small
functions: preempt_point() and preempt_needed().

- preempt(): if the LWP has exceeded its timeslice in kernel, strip it of
any priority boost gained earlier from blocking.


Revision tags: is-mlppp-base ad-namecache-base3
# 1.289 21-Feb-2020 riastradh

OOPS -- fix mistake in previous commit.

bbusy really needs to return the error; otherwise things are very
bad!


# 1.288 20-Feb-2020 riastradh

Buffer cache SDT probes.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.287 17-Jan-2020 ad

biodone2(): don't acquire kernel_lock for anybody anymore.


Revision tags: ad-namecache-base
# 1.286 31-Dec-2019 ad

branches: 1.286.2;
Rename uvm_free() -> uvm_availmem().


# 1.285 27-Dec-2019 msaitoh

s/transfered/transferred/


# 1.284 21-Dec-2019 ad

uvmexp.free -> uvm_free()


# 1.283 11-Dec-2019 ad

Add a comment.


# 1.282 08-Dec-2019 ad

For safety, cv_broadcast(&bp->b_busy) in more places where the buffer is
changing identity or moving from one vnode list to another.


# 1.281 08-Dec-2019 ad

Adjustment to previous: if we're going to toss the buffer, then wake
everybody.


# 1.280 08-Dec-2019 ad

- Avoid thundering herd: cv_broadcast(&bp->b_busy) -> cv_signal(&bp->b_busy)
- Sprinkle __cacheline_aligned.


Revision tags: phil-wifi-20191119
# 1.279 26-Aug-2019 msaitoh

Change buf_nbuf()'s return value from int to u_int to avoid undefined
behavior in wapbl_start() which extended int to size_t.

Error message was:
> UBSan: Undefined Behavior in ../../../../kern/vfs_wapbl.c:609:41, signed integer overflow: 3345138 * 1024 cannot be represented in type 'int'

> /* XXX maybe use filesystem fragment size instead of 1024 */
> /* XXX fix actual number of buffers reserved per filesystem. */
> wl->wl_bufcount_max = (buf_nbuf() / 2) * 1024;

Need more work?


Revision tags: netbsd-9-2-RELEASE netbsd-9-1-RELEASE netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126
# 1.278 24-Nov-2018 maxv

Fix kernel pointer leaks in sysctl_dobuf. While here constify argument.

Also memset the buffer, to prevent leaks (even if there doesn't seem to
be currently).


Revision tags: pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.277 29-Aug-2018 hannken

Make sure getnewbuf() runs bawrite() inside fstrans.
Use fstrans_start_nowait() to skip buffers that would block.


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.276 28-Oct-2017 pgoyette

branches: 1.276.2; 1.276.4;
Update the kernhist(9) kernel history code to address issues identified
in PR kern/52639, as well as some general cleaning-up...

(As proposed on tech-kern@ with additional changes and enhancements.)

Details of changes:

* All history arguments are now stored as uintmax_t values[1], both in
the kernel and in the structures used for exporting the history data
to userland via sysctl(9). This avoids problems on some architectures
where passing a 64-bit (or larger) value to printf(3) can cause it to
process the value as multiple arguments. (This can be particularly
problematic when printf()'s format string is not a literal, since in
that case the compiler cannot know how large each argument should be.)

* Update the data structures used for exporting kernel history data to
include a version number as well as the length of history arguments.

* All [2] existing users of kernhist(9) have had their format strings
updated. Each format specifier now includes an explicit length
modifier 'j' to refer to numeric values of the size of uintmax_t.

* All [2] existing users of kernhist(9) have had their format strings
updated to replace uses of "%p" with "%#jx", and the pointer
arguments are now cast to (uintptr_t) before being subsequently cast
to (uintmax_t). This is needed to avoid compiler warnings about
casting "pointer to integer of a different size."

* All [2] existing users of kernhist(9) have had instances of "%s" or
"%c" format strings replaced with numeric formats; several instances
of mis-match between format string and argument list have been fixed.

* vmstat(1) has been modified to handle the new size of arguments in the
history data as exported by sysctl(9).

* vmstat(1) now provides a warning message if the history requested with
the -u option does not exist (previously, this condition was silently
ignored, with only a single blank line being printed).

* vmstat(1) now checks the version and argument length included in the
data exported via sysctl(9) and exits if they do not match the values
with which vmstat was built.

* The kernhist(9) man-page has been updated to note the additional
requirements imposed on the format strings, along with several other
minor changes and enhancements.

[1] It would have been possible to use an explicit length (for example,
uint64_t) for the history arguments. But that would require another
"rototill" of all the users in the future when we add support for an
architecture that supports a larger size. Also, the printf(3) format
specifiers for explicitly-sized values, such as "%"PRIu64, are much
more verbose (and less aesthetically appealing, IMHO) than simply
using "%ju".

[2] I've tried very hard to find "all [the] existing users of kernhist(9)"
but it is possible that I've missed some of them. I would be glad to
update any stragglers that anyone identifies.


Revision tags: nick-nhusb-base-20170825
# 1.275 04-Aug-2017 mrg

normalise a BIOHIST log message


Revision tags: perseant-stdc-iso10646-base
# 1.274 08-Jun-2017 chs

move some buffer cache internals declarations from buf.h to vfs_bio.c.
this is needed to avoid name conflicts with ZFS and also
makes it clearer that other code shouldn't be messing with these.
remove the LFS debug code that poked around in bufqueues and
remove the BQ_EMPTY bufqueue since nothing uses it anymore.
provide a function to let LFS and wapbl read the value of nbuf for now.


Revision tags: netbsd-8-base
# 1.273 25-May-2017 pgoyette

branches: 1.273.2;
When logging a history record for biowait(), include the return address
as a parameter, to identify to which of the many calls to biowait() the
record refers.


Revision tags: prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.272 05-Apr-2017 jdolecek

expose disk device FUA/DPO support via DIOCGCACHE, and allow the flags
to be set for I/O; implement support in sd(4) and nvme(4)

discussed on tech-kern


# 1.271 21-Mar-2017 skrll

Use brelsel while the bufcache_lock is held rather than dropping it
and re-taking / dropping it in brelse


Revision tags: pgoyette-localcount-20170320
# 1.270 18-Mar-2017 riastradh

Nix trailing whitespace.


Revision tags: nick-nhusb-base-20170204
# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

branches: 1.267.2;
Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-2-RELEASE netbsd-7-1-2-RELEASE netbsd-7-1-1-RELEASE netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


# 1.302 30-Mar-2022 riastradh

kern: Sprinkle biglock-slippage assertions.

We seem to have a poltergeist that occasionally messes with the
biglock depth, but it's very hard to reproduce and only manifests as
some other CPU spinning out on the kernel lock which is no good for
diagnostics.


Revision tags: thorpej-i2c-spi-conf2-base thorpej-futex2-base thorpej-cfargs2-base thorpej-i2c-spi-conf-base
# 1.301 25-Jul-2021 simonb

If we're only doing a count-only kern.buf sysctl, just return the number
of active members in the pool cache (plus some slop) instead of looking
in all the free buffer list. Should reduce CPU usage for "systat vm"
to << 1% especially for machines with a larger number of buffers.


# 1.300 24-Jul-2021 simonb

Expose KERN_BUFSLOP in <sys/sysctl.h>.


# 1.299 24-Jul-2021 simonb

Pad out the slop for kern.buf based on the passed in element size,
rather than a size of an unrelated struct.


Revision tags: cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-cfargs-base thorpej-futex-base
# 1.298 01-Apr-2021 simonb

branches: 1.298.2;
Add a sysctl hashstat collector for bufhash.


# 1.297 31-Jul-2020 chs

branches: 1.297.2; 1.297.4;
fix the UFS2 extattr truncate code to play nice with wapbl.
also, rather than pull in the FreeBSD V_NORMAL/V_ALT flags to
vinvalbuf() and the buf b_xflags field and BX_ALTDATA flag,
add a binvalbuf() function to invalid a specific buffer
and use that to invalidate the two possible exattr bufs
during IO_EXT truncations.


# 1.296 11-Jun-2020 ad

uvm_availmem(): give it a boolean argument to specify whether a recent
cached value will do, or if the very latest total must be fetched. It can
be called thousands of times a second and fetching the totals impacts not
only the calling LWP but other CPUs doing unrelated activity in the VM
system.


# 1.295 27-Apr-2020 jdolecek

pass B_PHYS|B_RAW also in nestio_setup(), courtesy to e.g. xbd(4), which
wants to know whether the buf came from user space or bio subsystem


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421
# 1.294 20-Apr-2020 ad

Rename buf_syncwait() to vfs_syncwait(), and have it wait on v_numoutput
rather than BC_BUSY. Removes the dependency on bufhash.


Revision tags: bouyer-xenpvh-base1 phil-wifi-20200411
# 1.293 11-Apr-2020 jdolecek

for bmempools set align, not ioff


# 1.292 11-Apr-2020 jdolecek

explicitly use DEV_BSIZE align for all bmempools

this is required for Xen xbd(4) in order to not have to use bounce buffers

the alignment is implicitly provided when POOL_REDZONE is not active,
this change makes it also aligned when POOL_REDZONE _is_ active - that is
when (!KMSAN && (DIAGNOSTIC || KASAN))


# 1.291 10-Apr-2020 ad

Remove buffer reference counting, now that it's safe to destroy b_busy after
waking any waiters.


Revision tags: bouyer-xenpvh-base phil-wifi-20200406
# 1.290 14-Mar-2020 ad

branches: 1.290.2;
- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small
functions: preempt_point() and preempt_needed().

- preempt(): if the LWP has exceeded its timeslice in kernel, strip it of
any priority boost gained earlier from blocking.


Revision tags: is-mlppp-base ad-namecache-base3
# 1.289 21-Feb-2020 riastradh

OOPS -- fix mistake in previous commit.

bbusy really needs to return the error; otherwise things are very
bad!


# 1.288 20-Feb-2020 riastradh

Buffer cache SDT probes.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.287 17-Jan-2020 ad

biodone2(): don't acquire kernel_lock for anybody anymore.


Revision tags: ad-namecache-base
# 1.286 31-Dec-2019 ad

branches: 1.286.2;
Rename uvm_free() -> uvm_availmem().


# 1.285 27-Dec-2019 msaitoh

s/transfered/transferred/


# 1.284 21-Dec-2019 ad

uvmexp.free -> uvm_free()


# 1.283 11-Dec-2019 ad

Add a comment.


# 1.282 08-Dec-2019 ad

For safety, cv_broadcast(&bp->b_busy) in more places where the buffer is
changing identity or moving from one vnode list to another.


# 1.281 08-Dec-2019 ad

Adjustment to previous: if we're going to toss the buffer, then wake
everybody.


# 1.280 08-Dec-2019 ad

- Avoid thundering herd: cv_broadcast(&bp->b_busy) -> cv_signal(&bp->b_busy)
- Sprinkle __cacheline_aligned.


Revision tags: phil-wifi-20191119
# 1.279 26-Aug-2019 msaitoh

Change buf_nbuf()'s return value from int to u_int to avoid undefined
behavior in wapbl_start() which extended int to size_t.

Error message was:
> UBSan: Undefined Behavior in ../../../../kern/vfs_wapbl.c:609:41, signed integer overflow: 3345138 * 1024 cannot be represented in type 'int'

> /* XXX maybe use filesystem fragment size instead of 1024 */
> /* XXX fix actual number of buffers reserved per filesystem. */
> wl->wl_bufcount_max = (buf_nbuf() / 2) * 1024;

Need more work?


Revision tags: netbsd-9-2-RELEASE netbsd-9-1-RELEASE netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126
# 1.278 24-Nov-2018 maxv

Fix kernel pointer leaks in sysctl_dobuf. While here constify argument.

Also memset the buffer, to prevent leaks (even if there doesn't seem to
be currently).


Revision tags: pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.277 29-Aug-2018 hannken

Make sure getnewbuf() runs bawrite() inside fstrans.
Use fstrans_start_nowait() to skip buffers that would block.


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.276 28-Oct-2017 pgoyette

branches: 1.276.2; 1.276.4;
Update the kernhist(9) kernel history code to address issues identified
in PR kern/52639, as well as some general cleaning-up...

(As proposed on tech-kern@ with additional changes and enhancements.)

Details of changes:

* All history arguments are now stored as uintmax_t values[1], both in
the kernel and in the structures used for exporting the history data
to userland via sysctl(9). This avoids problems on some architectures
where passing a 64-bit (or larger) value to printf(3) can cause it to
process the value as multiple arguments. (This can be particularly
problematic when printf()'s format string is not a literal, since in
that case the compiler cannot know how large each argument should be.)

* Update the data structures used for exporting kernel history data to
include a version number as well as the length of history arguments.

* All [2] existing users of kernhist(9) have had their format strings
updated. Each format specifier now includes an explicit length
modifier 'j' to refer to numeric values of the size of uintmax_t.

* All [2] existing users of kernhist(9) have had their format strings
updated to replace uses of "%p" with "%#jx", and the pointer
arguments are now cast to (uintptr_t) before being subsequently cast
to (uintmax_t). This is needed to avoid compiler warnings about
casting "pointer to integer of a different size."

* All [2] existing users of kernhist(9) have had instances of "%s" or
"%c" format strings replaced with numeric formats; several instances
of mis-match between format string and argument list have been fixed.

* vmstat(1) has been modified to handle the new size of arguments in the
history data as exported by sysctl(9).

* vmstat(1) now provides a warning message if the history requested with
the -u option does not exist (previously, this condition was silently
ignored, with only a single blank line being printed).

* vmstat(1) now checks the version and argument length included in the
data exported via sysctl(9) and exits if they do not match the values
with which vmstat was built.

* The kernhist(9) man-page has been updated to note the additional
requirements imposed on the format strings, along with several other
minor changes and enhancements.

[1] It would have been possible to use an explicit length (for example,
uint64_t) for the history arguments. But that would require another
"rototill" of all the users in the future when we add support for an
architecture that supports a larger size. Also, the printf(3) format
specifiers for explicitly-sized values, such as "%"PRIu64, are much
more verbose (and less aesthetically appealing, IMHO) than simply
using "%ju".

[2] I've tried very hard to find "all [the] existing users of kernhist(9)"
but it is possible that I've missed some of them. I would be glad to
update any stragglers that anyone identifies.


Revision tags: nick-nhusb-base-20170825
# 1.275 04-Aug-2017 mrg

normalise a BIOHIST log message


Revision tags: perseant-stdc-iso10646-base
# 1.274 08-Jun-2017 chs

move some buffer cache internals declarations from buf.h to vfs_bio.c.
this is needed to avoid name conflicts with ZFS and also
makes it clearer that other code shouldn't be messing with these.
remove the LFS debug code that poked around in bufqueues and
remove the BQ_EMPTY bufqueue since nothing uses it anymore.
provide a function to let LFS and wapbl read the value of nbuf for now.


Revision tags: netbsd-8-base
# 1.273 25-May-2017 pgoyette

branches: 1.273.2;
When logging a history record for biowait(), include the return address
as a parameter, to identify to which of the many calls to biowait() the
record refers.


Revision tags: prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.272 05-Apr-2017 jdolecek

expose disk device FUA/DPO support via DIOCGCACHE, and allow the flags
to be set for I/O; implement support in sd(4) and nvme(4)

discussed on tech-kern


# 1.271 21-Mar-2017 skrll

Use brelsel while the bufcache_lock is held rather than dropping it
and re-taking / dropping it in brelse


Revision tags: pgoyette-localcount-20170320
# 1.270 18-Mar-2017 riastradh

Nix trailing whitespace.


Revision tags: nick-nhusb-base-20170204
# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

branches: 1.267.2;
Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-2-RELEASE netbsd-7-1-2-RELEASE netbsd-7-1-1-RELEASE netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


# 1.301 25-Jul-2021 simonb

If we're only doing a count-only kern.buf sysctl, just return the number
of active members in the pool cache (plus some slop) instead of looking
in all the free buffer list. Should reduce CPU usage for "systat vm"
to << 1% especially for machines with a larger number of buffers.


# 1.300 24-Jul-2021 simonb

Expose KERN_BUFSLOP in <sys/sysctl.h>.


# 1.299 24-Jul-2021 simonb

Pad out the slop for kern.buf based on the passed in element size,
rather than a size of an unrelated struct.


Revision tags: cjep_sun2x-base1 cjep_sun2x-base cjep_staticlib_x-base1 cjep_staticlib_x-base thorpej-i2c-spi-conf-base thorpej-cfargs-base thorpej-futex-base
# 1.298 01-Apr-2021 simonb

Add a sysctl hashstat collector for bufhash.


# 1.297 31-Jul-2020 chs

branches: 1.297.2; 1.297.4;
fix the UFS2 extattr truncate code to play nice with wapbl.
also, rather than pull in the FreeBSD V_NORMAL/V_ALT flags to
vinvalbuf() and the buf b_xflags field and BX_ALTDATA flag,
add a binvalbuf() function to invalid a specific buffer
and use that to invalidate the two possible exattr bufs
during IO_EXT truncations.


# 1.296 11-Jun-2020 ad

uvm_availmem(): give it a boolean argument to specify whether a recent
cached value will do, or if the very latest total must be fetched. It can
be called thousands of times a second and fetching the totals impacts not
only the calling LWP but other CPUs doing unrelated activity in the VM
system.


# 1.295 27-Apr-2020 jdolecek

pass B_PHYS|B_RAW also in nestio_setup(), courtesy to e.g. xbd(4), which
wants to know whether the buf came from user space or bio subsystem


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421
# 1.294 20-Apr-2020 ad

Rename buf_syncwait() to vfs_syncwait(), and have it wait on v_numoutput
rather than BC_BUSY. Removes the dependency on bufhash.


Revision tags: bouyer-xenpvh-base1 phil-wifi-20200411
# 1.293 11-Apr-2020 jdolecek

for bmempools set align, not ioff


# 1.292 11-Apr-2020 jdolecek

explicitly use DEV_BSIZE align for all bmempools

this is required for Xen xbd(4) in order to not have to use bounce buffers

the alignment is implicitly provided when POOL_REDZONE is not active,
this change makes it also aligned when POOL_REDZONE _is_ active - that is
when (!KMSAN && (DIAGNOSTIC || KASAN))


# 1.291 10-Apr-2020 ad

Remove buffer reference counting, now that it's safe to destroy b_busy after
waking any waiters.


Revision tags: bouyer-xenpvh-base phil-wifi-20200406
# 1.290 14-Mar-2020 ad

branches: 1.290.2;
- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small
functions: preempt_point() and preempt_needed().

- preempt(): if the LWP has exceeded its timeslice in kernel, strip it of
any priority boost gained earlier from blocking.


Revision tags: is-mlppp-base ad-namecache-base3
# 1.289 21-Feb-2020 riastradh

OOPS -- fix mistake in previous commit.

bbusy really needs to return the error; otherwise things are very
bad!


# 1.288 20-Feb-2020 riastradh

Buffer cache SDT probes.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.287 17-Jan-2020 ad

biodone2(): don't acquire kernel_lock for anybody anymore.


Revision tags: ad-namecache-base
# 1.286 31-Dec-2019 ad

branches: 1.286.2;
Rename uvm_free() -> uvm_availmem().


# 1.285 27-Dec-2019 msaitoh

s/transfered/transferred/


# 1.284 21-Dec-2019 ad

uvmexp.free -> uvm_free()


# 1.283 11-Dec-2019 ad

Add a comment.


# 1.282 08-Dec-2019 ad

For safety, cv_broadcast(&bp->b_busy) in more places where the buffer is
changing identity or moving from one vnode list to another.


# 1.281 08-Dec-2019 ad

Adjustment to previous: if we're going to toss the buffer, then wake
everybody.


# 1.280 08-Dec-2019 ad

- Avoid thundering herd: cv_broadcast(&bp->b_busy) -> cv_signal(&bp->b_busy)
- Sprinkle __cacheline_aligned.


Revision tags: phil-wifi-20191119
# 1.279 26-Aug-2019 msaitoh

Change buf_nbuf()'s return value from int to u_int to avoid undefined
behavior in wapbl_start() which extended int to size_t.

Error message was:
> UBSan: Undefined Behavior in ../../../../kern/vfs_wapbl.c:609:41, signed integer overflow: 3345138 * 1024 cannot be represented in type 'int'

> /* XXX maybe use filesystem fragment size instead of 1024 */
> /* XXX fix actual number of buffers reserved per filesystem. */
> wl->wl_bufcount_max = (buf_nbuf() / 2) * 1024;

Need more work?


Revision tags: netbsd-9-2-RELEASE netbsd-9-1-RELEASE netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126
# 1.278 24-Nov-2018 maxv

Fix kernel pointer leaks in sysctl_dobuf. While here constify argument.

Also memset the buffer, to prevent leaks (even if there doesn't seem to
be currently).


Revision tags: pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.277 29-Aug-2018 hannken

Make sure getnewbuf() runs bawrite() inside fstrans.
Use fstrans_start_nowait() to skip buffers that would block.


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.276 28-Oct-2017 pgoyette

branches: 1.276.2; 1.276.4;
Update the kernhist(9) kernel history code to address issues identified
in PR kern/52639, as well as some general cleaning-up...

(As proposed on tech-kern@ with additional changes and enhancements.)

Details of changes:

* All history arguments are now stored as uintmax_t values[1], both in
the kernel and in the structures used for exporting the history data
to userland via sysctl(9). This avoids problems on some architectures
where passing a 64-bit (or larger) value to printf(3) can cause it to
process the value as multiple arguments. (This can be particularly
problematic when printf()'s format string is not a literal, since in
that case the compiler cannot know how large each argument should be.)

* Update the data structures used for exporting kernel history data to
include a version number as well as the length of history arguments.

* All [2] existing users of kernhist(9) have had their format strings
updated. Each format specifier now includes an explicit length
modifier 'j' to refer to numeric values of the size of uintmax_t.

* All [2] existing users of kernhist(9) have had their format strings
updated to replace uses of "%p" with "%#jx", and the pointer
arguments are now cast to (uintptr_t) before being subsequently cast
to (uintmax_t). This is needed to avoid compiler warnings about
casting "pointer to integer of a different size."

* All [2] existing users of kernhist(9) have had instances of "%s" or
"%c" format strings replaced with numeric formats; several instances
of mis-match between format string and argument list have been fixed.

* vmstat(1) has been modified to handle the new size of arguments in the
history data as exported by sysctl(9).

* vmstat(1) now provides a warning message if the history requested with
the -u option does not exist (previously, this condition was silently
ignored, with only a single blank line being printed).

* vmstat(1) now checks the version and argument length included in the
data exported via sysctl(9) and exits if they do not match the values
with which vmstat was built.

* The kernhist(9) man-page has been updated to note the additional
requirements imposed on the format strings, along with several other
minor changes and enhancements.

[1] It would have been possible to use an explicit length (for example,
uint64_t) for the history arguments. But that would require another
"rototill" of all the users in the future when we add support for an
architecture that supports a larger size. Also, the printf(3) format
specifiers for explicitly-sized values, such as "%"PRIu64, are much
more verbose (and less aesthetically appealing, IMHO) than simply
using "%ju".

[2] I've tried very hard to find "all [the] existing users of kernhist(9)"
but it is possible that I've missed some of them. I would be glad to
update any stragglers that anyone identifies.


Revision tags: nick-nhusb-base-20170825
# 1.275 04-Aug-2017 mrg

normalise a BIOHIST log message


Revision tags: perseant-stdc-iso10646-base
# 1.274 08-Jun-2017 chs

move some buffer cache internals declarations from buf.h to vfs_bio.c.
this is needed to avoid name conflicts with ZFS and also
makes it clearer that other code shouldn't be messing with these.
remove the LFS debug code that poked around in bufqueues and
remove the BQ_EMPTY bufqueue since nothing uses it anymore.
provide a function to let LFS and wapbl read the value of nbuf for now.


Revision tags: netbsd-8-base
# 1.273 25-May-2017 pgoyette

branches: 1.273.2;
When logging a history record for biowait(), include the return address
as a parameter, to identify to which of the many calls to biowait() the
record refers.


Revision tags: prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.272 05-Apr-2017 jdolecek

expose disk device FUA/DPO support via DIOCGCACHE, and allow the flags
to be set for I/O; implement support in sd(4) and nvme(4)

discussed on tech-kern


# 1.271 21-Mar-2017 skrll

Use brelsel while the bufcache_lock is held rather than dropping it
and re-taking / dropping it in brelse


Revision tags: pgoyette-localcount-20170320
# 1.270 18-Mar-2017 riastradh

Nix trailing whitespace.


Revision tags: nick-nhusb-base-20170204
# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

branches: 1.267.2;
Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-2-RELEASE netbsd-7-1-2-RELEASE netbsd-7-1-1-RELEASE netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


# 1.298 01-Apr-2021 simonb

Add a sysctl hashstat collector for bufhash.


Revision tags: thorpej-cfargs-base thorpej-futex-base
# 1.297 31-Jul-2020 chs

fix the UFS2 extattr truncate code to play nice with wapbl.
also, rather than pull in the FreeBSD V_NORMAL/V_ALT flags to
vinvalbuf() and the buf b_xflags field and BX_ALTDATA flag,
add a binvalbuf() function to invalid a specific buffer
and use that to invalidate the two possible exattr bufs
during IO_EXT truncations.


# 1.296 11-Jun-2020 ad

uvm_availmem(): give it a boolean argument to specify whether a recent
cached value will do, or if the very latest total must be fetched. It can
be called thousands of times a second and fetching the totals impacts not
only the calling LWP but other CPUs doing unrelated activity in the VM
system.


# 1.295 27-Apr-2020 jdolecek

pass B_PHYS|B_RAW also in nestio_setup(), courtesy to e.g. xbd(4), which
wants to know whether the buf came from user space or bio subsystem


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421
# 1.294 20-Apr-2020 ad

Rename buf_syncwait() to vfs_syncwait(), and have it wait on v_numoutput
rather than BC_BUSY. Removes the dependency on bufhash.


Revision tags: bouyer-xenpvh-base1 phil-wifi-20200411
# 1.293 11-Apr-2020 jdolecek

for bmempools set align, not ioff


# 1.292 11-Apr-2020 jdolecek

explicitly use DEV_BSIZE align for all bmempools

this is required for Xen xbd(4) in order to not have to use bounce buffers

the alignment is implicitly provided when POOL_REDZONE is not active,
this change makes it also aligned when POOL_REDZONE _is_ active - that is
when (!KMSAN && (DIAGNOSTIC || KASAN))


# 1.291 10-Apr-2020 ad

Remove buffer reference counting, now that it's safe to destroy b_busy after
waking any waiters.


Revision tags: bouyer-xenpvh-base phil-wifi-20200406
# 1.290 14-Mar-2020 ad

branches: 1.290.2;
- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small
functions: preempt_point() and preempt_needed().

- preempt(): if the LWP has exceeded its timeslice in kernel, strip it of
any priority boost gained earlier from blocking.


Revision tags: is-mlppp-base ad-namecache-base3
# 1.289 21-Feb-2020 riastradh

OOPS -- fix mistake in previous commit.

bbusy really needs to return the error; otherwise things are very
bad!


# 1.288 20-Feb-2020 riastradh

Buffer cache SDT probes.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.287 17-Jan-2020 ad

biodone2(): don't acquire kernel_lock for anybody anymore.


Revision tags: ad-namecache-base
# 1.286 31-Dec-2019 ad

branches: 1.286.2;
Rename uvm_free() -> uvm_availmem().


# 1.285 27-Dec-2019 msaitoh

s/transfered/transferred/


# 1.284 21-Dec-2019 ad

uvmexp.free -> uvm_free()


# 1.283 11-Dec-2019 ad

Add a comment.


# 1.282 08-Dec-2019 ad

For safety, cv_broadcast(&bp->b_busy) in more places where the buffer is
changing identity or moving from one vnode list to another.


# 1.281 08-Dec-2019 ad

Adjustment to previous: if we're going to toss the buffer, then wake
everybody.


# 1.280 08-Dec-2019 ad

- Avoid thundering herd: cv_broadcast(&bp->b_busy) -> cv_signal(&bp->b_busy)
- Sprinkle __cacheline_aligned.


Revision tags: phil-wifi-20191119
# 1.279 26-Aug-2019 msaitoh

Change buf_nbuf()'s return value from int to u_int to avoid undefined
behavior in wapbl_start() which extended int to size_t.

Error message was:
> UBSan: Undefined Behavior in ../../../../kern/vfs_wapbl.c:609:41, signed integer overflow: 3345138 * 1024 cannot be represented in type 'int'

> /* XXX maybe use filesystem fragment size instead of 1024 */
> /* XXX fix actual number of buffers reserved per filesystem. */
> wl->wl_bufcount_max = (buf_nbuf() / 2) * 1024;

Need more work?


Revision tags: netbsd-9-1-RELEASE netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126
# 1.278 24-Nov-2018 maxv

Fix kernel pointer leaks in sysctl_dobuf. While here constify argument.

Also memset the buffer, to prevent leaks (even if there doesn't seem to
be currently).


Revision tags: pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.277 29-Aug-2018 hannken

Make sure getnewbuf() runs bawrite() inside fstrans.
Use fstrans_start_nowait() to skip buffers that would block.


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.276 28-Oct-2017 pgoyette

branches: 1.276.2; 1.276.4;
Update the kernhist(9) kernel history code to address issues identified
in PR kern/52639, as well as some general cleaning-up...

(As proposed on tech-kern@ with additional changes and enhancements.)

Details of changes:

* All history arguments are now stored as uintmax_t values[1], both in
the kernel and in the structures used for exporting the history data
to userland via sysctl(9). This avoids problems on some architectures
where passing a 64-bit (or larger) value to printf(3) can cause it to
process the value as multiple arguments. (This can be particularly
problematic when printf()'s format string is not a literal, since in
that case the compiler cannot know how large each argument should be.)

* Update the data structures used for exporting kernel history data to
include a version number as well as the length of history arguments.

* All [2] existing users of kernhist(9) have had their format strings
updated. Each format specifier now includes an explicit length
modifier 'j' to refer to numeric values of the size of uintmax_t.

* All [2] existing users of kernhist(9) have had their format strings
updated to replace uses of "%p" with "%#jx", and the pointer
arguments are now cast to (uintptr_t) before being subsequently cast
to (uintmax_t). This is needed to avoid compiler warnings about
casting "pointer to integer of a different size."

* All [2] existing users of kernhist(9) have had instances of "%s" or
"%c" format strings replaced with numeric formats; several instances
of mis-match between format string and argument list have been fixed.

* vmstat(1) has been modified to handle the new size of arguments in the
history data as exported by sysctl(9).

* vmstat(1) now provides a warning message if the history requested with
the -u option does not exist (previously, this condition was silently
ignored, with only a single blank line being printed).

* vmstat(1) now checks the version and argument length included in the
data exported via sysctl(9) and exits if they do not match the values
with which vmstat was built.

* The kernhist(9) man-page has been updated to note the additional
requirements imposed on the format strings, along with several other
minor changes and enhancements.

[1] It would have been possible to use an explicit length (for example,
uint64_t) for the history arguments. But that would require another
"rototill" of all the users in the future when we add support for an
architecture that supports a larger size. Also, the printf(3) format
specifiers for explicitly-sized values, such as "%"PRIu64, are much
more verbose (and less aesthetically appealing, IMHO) than simply
using "%ju".

[2] I've tried very hard to find "all [the] existing users of kernhist(9)"
but it is possible that I've missed some of them. I would be glad to
update any stragglers that anyone identifies.


Revision tags: nick-nhusb-base-20170825
# 1.275 04-Aug-2017 mrg

normalise a BIOHIST log message


Revision tags: perseant-stdc-iso10646-base
# 1.274 08-Jun-2017 chs

move some buffer cache internals declarations from buf.h to vfs_bio.c.
this is needed to avoid name conflicts with ZFS and also
makes it clearer that other code shouldn't be messing with these.
remove the LFS debug code that poked around in bufqueues and
remove the BQ_EMPTY bufqueue since nothing uses it anymore.
provide a function to let LFS and wapbl read the value of nbuf for now.


Revision tags: netbsd-8-base
# 1.273 25-May-2017 pgoyette

branches: 1.273.2;
When logging a history record for biowait(), include the return address
as a parameter, to identify to which of the many calls to biowait() the
record refers.


Revision tags: prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.272 05-Apr-2017 jdolecek

expose disk device FUA/DPO support via DIOCGCACHE, and allow the flags
to be set for I/O; implement support in sd(4) and nvme(4)

discussed on tech-kern


# 1.271 21-Mar-2017 skrll

Use brelsel while the bufcache_lock is held rather than dropping it
and re-taking / dropping it in brelse


Revision tags: pgoyette-localcount-20170320
# 1.270 18-Mar-2017 riastradh

Nix trailing whitespace.


Revision tags: nick-nhusb-base-20170204
# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

branches: 1.267.2;
Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-2-RELEASE netbsd-7-1-2-RELEASE netbsd-7-1-1-RELEASE netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


# 1.297 31-Jul-2020 chs

fix the UFS2 extattr truncate code to play nice with wapbl.
also, rather than pull in the FreeBSD V_NORMAL/V_ALT flags to
vinvalbuf() and the buf b_xflags field and BX_ALTDATA flag,
add a binvalbuf() function to invalid a specific buffer
and use that to invalidate the two possible exattr bufs
during IO_EXT truncations.


# 1.296 11-Jun-2020 ad

uvm_availmem(): give it a boolean argument to specify whether a recent
cached value will do, or if the very latest total must be fetched. It can
be called thousands of times a second and fetching the totals impacts not
only the calling LWP but other CPUs doing unrelated activity in the VM
system.


# 1.295 27-Apr-2020 jdolecek

pass B_PHYS|B_RAW also in nestio_setup(), courtesy to e.g. xbd(4), which
wants to know whether the buf came from user space or bio subsystem


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421
# 1.294 20-Apr-2020 ad

Rename buf_syncwait() to vfs_syncwait(), and have it wait on v_numoutput
rather than BC_BUSY. Removes the dependency on bufhash.


Revision tags: bouyer-xenpvh-base1 phil-wifi-20200411
# 1.293 11-Apr-2020 jdolecek

for bmempools set align, not ioff


# 1.292 11-Apr-2020 jdolecek

explicitly use DEV_BSIZE align for all bmempools

this is required for Xen xbd(4) in order to not have to use bounce buffers

the alignment is implicitly provided when POOL_REDZONE is not active,
this change makes it also aligned when POOL_REDZONE _is_ active - that is
when (!KMSAN && (DIAGNOSTIC || KASAN))


# 1.291 10-Apr-2020 ad

Remove buffer reference counting, now that it's safe to destroy b_busy after
waking any waiters.


Revision tags: bouyer-xenpvh-base phil-wifi-20200406
# 1.290 14-Mar-2020 ad

branches: 1.290.2;
- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small
functions: preempt_point() and preempt_needed().

- preempt(): if the LWP has exceeded its timeslice in kernel, strip it of
any priority boost gained earlier from blocking.


Revision tags: is-mlppp-base ad-namecache-base3
# 1.289 21-Feb-2020 riastradh

OOPS -- fix mistake in previous commit.

bbusy really needs to return the error; otherwise things are very
bad!


# 1.288 20-Feb-2020 riastradh

Buffer cache SDT probes.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.287 17-Jan-2020 ad

biodone2(): don't acquire kernel_lock for anybody anymore.


Revision tags: ad-namecache-base
# 1.286 31-Dec-2019 ad

branches: 1.286.2;
Rename uvm_free() -> uvm_availmem().


# 1.285 27-Dec-2019 msaitoh

s/transfered/transferred/


# 1.284 21-Dec-2019 ad

uvmexp.free -> uvm_free()


# 1.283 11-Dec-2019 ad

Add a comment.


# 1.282 08-Dec-2019 ad

For safety, cv_broadcast(&bp->b_busy) in more places where the buffer is
changing identity or moving from one vnode list to another.


# 1.281 08-Dec-2019 ad

Adjustment to previous: if we're going to toss the buffer, then wake
everybody.


# 1.280 08-Dec-2019 ad

- Avoid thundering herd: cv_broadcast(&bp->b_busy) -> cv_signal(&bp->b_busy)
- Sprinkle __cacheline_aligned.


Revision tags: phil-wifi-20191119
# 1.279 26-Aug-2019 msaitoh

Change buf_nbuf()'s return value from int to u_int to avoid undefined
behavior in wapbl_start() which extended int to size_t.

Error message was:
> UBSan: Undefined Behavior in ../../../../kern/vfs_wapbl.c:609:41, signed integer overflow: 3345138 * 1024 cannot be represented in type 'int'

> /* XXX maybe use filesystem fragment size instead of 1024 */
> /* XXX fix actual number of buffers reserved per filesystem. */
> wl->wl_bufcount_max = (buf_nbuf() / 2) * 1024;

Need more work?


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126
# 1.278 24-Nov-2018 maxv

Fix kernel pointer leaks in sysctl_dobuf. While here constify argument.

Also memset the buffer, to prevent leaks (even if there doesn't seem to
be currently).


Revision tags: pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.277 29-Aug-2018 hannken

Make sure getnewbuf() runs bawrite() inside fstrans.
Use fstrans_start_nowait() to skip buffers that would block.


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.276 28-Oct-2017 pgoyette

branches: 1.276.2; 1.276.4;
Update the kernhist(9) kernel history code to address issues identified
in PR kern/52639, as well as some general cleaning-up...

(As proposed on tech-kern@ with additional changes and enhancements.)

Details of changes:

* All history arguments are now stored as uintmax_t values[1], both in
the kernel and in the structures used for exporting the history data
to userland via sysctl(9). This avoids problems on some architectures
where passing a 64-bit (or larger) value to printf(3) can cause it to
process the value as multiple arguments. (This can be particularly
problematic when printf()'s format string is not a literal, since in
that case the compiler cannot know how large each argument should be.)

* Update the data structures used for exporting kernel history data to
include a version number as well as the length of history arguments.

* All [2] existing users of kernhist(9) have had their format strings
updated. Each format specifier now includes an explicit length
modifier 'j' to refer to numeric values of the size of uintmax_t.

* All [2] existing users of kernhist(9) have had their format strings
updated to replace uses of "%p" with "%#jx", and the pointer
arguments are now cast to (uintptr_t) before being subsequently cast
to (uintmax_t). This is needed to avoid compiler warnings about
casting "pointer to integer of a different size."

* All [2] existing users of kernhist(9) have had instances of "%s" or
"%c" format strings replaced with numeric formats; several instances
of mis-match between format string and argument list have been fixed.

* vmstat(1) has been modified to handle the new size of arguments in the
history data as exported by sysctl(9).

* vmstat(1) now provides a warning message if the history requested with
the -u option does not exist (previously, this condition was silently
ignored, with only a single blank line being printed).

* vmstat(1) now checks the version and argument length included in the
data exported via sysctl(9) and exits if they do not match the values
with which vmstat was built.

* The kernhist(9) man-page has been updated to note the additional
requirements imposed on the format strings, along with several other
minor changes and enhancements.

[1] It would have been possible to use an explicit length (for example,
uint64_t) for the history arguments. But that would require another
"rototill" of all the users in the future when we add support for an
architecture that supports a larger size. Also, the printf(3) format
specifiers for explicitly-sized values, such as "%"PRIu64, are much
more verbose (and less aesthetically appealing, IMHO) than simply
using "%ju".

[2] I've tried very hard to find "all [the] existing users of kernhist(9)"
but it is possible that I've missed some of them. I would be glad to
update any stragglers that anyone identifies.


Revision tags: nick-nhusb-base-20170825
# 1.275 04-Aug-2017 mrg

normalise a BIOHIST log message


Revision tags: perseant-stdc-iso10646-base
# 1.274 08-Jun-2017 chs

move some buffer cache internals declarations from buf.h to vfs_bio.c.
this is needed to avoid name conflicts with ZFS and also
makes it clearer that other code shouldn't be messing with these.
remove the LFS debug code that poked around in bufqueues and
remove the BQ_EMPTY bufqueue since nothing uses it anymore.
provide a function to let LFS and wapbl read the value of nbuf for now.


Revision tags: netbsd-8-base
# 1.273 25-May-2017 pgoyette

branches: 1.273.2;
When logging a history record for biowait(), include the return address
as a parameter, to identify to which of the many calls to biowait() the
record refers.


Revision tags: prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.272 05-Apr-2017 jdolecek

expose disk device FUA/DPO support via DIOCGCACHE, and allow the flags
to be set for I/O; implement support in sd(4) and nvme(4)

discussed on tech-kern


# 1.271 21-Mar-2017 skrll

Use brelsel while the bufcache_lock is held rather than dropping it
and re-taking / dropping it in brelse


Revision tags: pgoyette-localcount-20170320
# 1.270 18-Mar-2017 riastradh

Nix trailing whitespace.


Revision tags: nick-nhusb-base-20170204
# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

branches: 1.267.2;
Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-2-RELEASE netbsd-7-1-2-RELEASE netbsd-7-1-1-RELEASE netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


# 1.296 11-Jun-2020 ad

uvm_availmem(): give it a boolean argument to specify whether a recent
cached value will do, or if the very latest total must be fetched. It can
be called thousands of times a second and fetching the totals impacts not
only the calling LWP but other CPUs doing unrelated activity in the VM
system.


# 1.295 27-Apr-2020 jdolecek

pass B_PHYS|B_RAW also in nestio_setup(), courtesy to e.g. xbd(4), which
wants to know whether the buf came from user space or bio subsystem


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421
# 1.294 20-Apr-2020 ad

Rename buf_syncwait() to vfs_syncwait(), and have it wait on v_numoutput
rather than BC_BUSY. Removes the dependency on bufhash.


Revision tags: bouyer-xenpvh-base1 phil-wifi-20200411
# 1.293 11-Apr-2020 jdolecek

for bmempools set align, not ioff


# 1.292 11-Apr-2020 jdolecek

explicitly use DEV_BSIZE align for all bmempools

this is required for Xen xbd(4) in order to not have to use bounce buffers

the alignment is implicitly provided when POOL_REDZONE is not active,
this change makes it also aligned when POOL_REDZONE _is_ active - that is
when (!KMSAN && (DIAGNOSTIC || KASAN))


# 1.291 10-Apr-2020 ad

Remove buffer reference counting, now that it's safe to destroy b_busy after
waking any waiters.


Revision tags: bouyer-xenpvh-base phil-wifi-20200406
# 1.290 14-Mar-2020 ad

branches: 1.290.2;
- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small
functions: preempt_point() and preempt_needed().

- preempt(): if the LWP has exceeded its timeslice in kernel, strip it of
any priority boost gained earlier from blocking.


Revision tags: is-mlppp-base ad-namecache-base3
# 1.289 21-Feb-2020 riastradh

OOPS -- fix mistake in previous commit.

bbusy really needs to return the error; otherwise things are very
bad!


# 1.288 20-Feb-2020 riastradh

Buffer cache SDT probes.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.287 17-Jan-2020 ad

biodone2(): don't acquire kernel_lock for anybody anymore.


Revision tags: ad-namecache-base
# 1.286 31-Dec-2019 ad

branches: 1.286.2;
Rename uvm_free() -> uvm_availmem().


# 1.285 27-Dec-2019 msaitoh

s/transfered/transferred/


# 1.284 21-Dec-2019 ad

uvmexp.free -> uvm_free()


# 1.283 11-Dec-2019 ad

Add a comment.


# 1.282 08-Dec-2019 ad

For safety, cv_broadcast(&bp->b_busy) in more places where the buffer is
changing identity or moving from one vnode list to another.


# 1.281 08-Dec-2019 ad

Adjustment to previous: if we're going to toss the buffer, then wake
everybody.


# 1.280 08-Dec-2019 ad

- Avoid thundering herd: cv_broadcast(&bp->b_busy) -> cv_signal(&bp->b_busy)
- Sprinkle __cacheline_aligned.


Revision tags: phil-wifi-20191119
# 1.279 26-Aug-2019 msaitoh

Change buf_nbuf()'s return value from int to u_int to avoid undefined
behavior in wapbl_start() which extended int to size_t.

Error message was:
> UBSan: Undefined Behavior in ../../../../kern/vfs_wapbl.c:609:41, signed integer overflow: 3345138 * 1024 cannot be represented in type 'int'

> /* XXX maybe use filesystem fragment size instead of 1024 */
> /* XXX fix actual number of buffers reserved per filesystem. */
> wl->wl_bufcount_max = (buf_nbuf() / 2) * 1024;

Need more work?


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126
# 1.278 24-Nov-2018 maxv

Fix kernel pointer leaks in sysctl_dobuf. While here constify argument.

Also memset the buffer, to prevent leaks (even if there doesn't seem to
be currently).


Revision tags: pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.277 29-Aug-2018 hannken

Make sure getnewbuf() runs bawrite() inside fstrans.
Use fstrans_start_nowait() to skip buffers that would block.


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.276 28-Oct-2017 pgoyette

branches: 1.276.2; 1.276.4;
Update the kernhist(9) kernel history code to address issues identified
in PR kern/52639, as well as some general cleaning-up...

(As proposed on tech-kern@ with additional changes and enhancements.)

Details of changes:

* All history arguments are now stored as uintmax_t values[1], both in
the kernel and in the structures used for exporting the history data
to userland via sysctl(9). This avoids problems on some architectures
where passing a 64-bit (or larger) value to printf(3) can cause it to
process the value as multiple arguments. (This can be particularly
problematic when printf()'s format string is not a literal, since in
that case the compiler cannot know how large each argument should be.)

* Update the data structures used for exporting kernel history data to
include a version number as well as the length of history arguments.

* All [2] existing users of kernhist(9) have had their format strings
updated. Each format specifier now includes an explicit length
modifier 'j' to refer to numeric values of the size of uintmax_t.

* All [2] existing users of kernhist(9) have had their format strings
updated to replace uses of "%p" with "%#jx", and the pointer
arguments are now cast to (uintptr_t) before being subsequently cast
to (uintmax_t). This is needed to avoid compiler warnings about
casting "pointer to integer of a different size."

* All [2] existing users of kernhist(9) have had instances of "%s" or
"%c" format strings replaced with numeric formats; several instances
of mis-match between format string and argument list have been fixed.

* vmstat(1) has been modified to handle the new size of arguments in the
history data as exported by sysctl(9).

* vmstat(1) now provides a warning message if the history requested with
the -u option does not exist (previously, this condition was silently
ignored, with only a single blank line being printed).

* vmstat(1) now checks the version and argument length included in the
data exported via sysctl(9) and exits if they do not match the values
with which vmstat was built.

* The kernhist(9) man-page has been updated to note the additional
requirements imposed on the format strings, along with several other
minor changes and enhancements.

[1] It would have been possible to use an explicit length (for example,
uint64_t) for the history arguments. But that would require another
"rototill" of all the users in the future when we add support for an
architecture that supports a larger size. Also, the printf(3) format
specifiers for explicitly-sized values, such as "%"PRIu64, are much
more verbose (and less aesthetically appealing, IMHO) than simply
using "%ju".

[2] I've tried very hard to find "all [the] existing users of kernhist(9)"
but it is possible that I've missed some of them. I would be glad to
update any stragglers that anyone identifies.


Revision tags: nick-nhusb-base-20170825
# 1.275 04-Aug-2017 mrg

normalise a BIOHIST log message


Revision tags: perseant-stdc-iso10646-base
# 1.274 08-Jun-2017 chs

move some buffer cache internals declarations from buf.h to vfs_bio.c.
this is needed to avoid name conflicts with ZFS and also
makes it clearer that other code shouldn't be messing with these.
remove the LFS debug code that poked around in bufqueues and
remove the BQ_EMPTY bufqueue since nothing uses it anymore.
provide a function to let LFS and wapbl read the value of nbuf for now.


Revision tags: netbsd-8-base
# 1.273 25-May-2017 pgoyette

branches: 1.273.2;
When logging a history record for biowait(), include the return address
as a parameter, to identify to which of the many calls to biowait() the
record refers.


Revision tags: prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.272 05-Apr-2017 jdolecek

expose disk device FUA/DPO support via DIOCGCACHE, and allow the flags
to be set for I/O; implement support in sd(4) and nvme(4)

discussed on tech-kern


# 1.271 21-Mar-2017 skrll

Use brelsel while the bufcache_lock is held rather than dropping it
and re-taking / dropping it in brelse


Revision tags: pgoyette-localcount-20170320
# 1.270 18-Mar-2017 riastradh

Nix trailing whitespace.


Revision tags: nick-nhusb-base-20170204
# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

branches: 1.267.2;
Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-2-RELEASE netbsd-7-1-2-RELEASE netbsd-7-1-1-RELEASE netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


# 1.295 27-Apr-2020 jdolecek

pass B_PHYS|B_RAW also in nestio_setup(), courtesy to e.g. xbd(4), which
wants to know whether the buf came from user space or bio subsystem


Revision tags: bouyer-xenpvh-base2 phil-wifi-20200421
# 1.294 20-Apr-2020 ad

Rename buf_syncwait() to vfs_syncwait(), and have it wait on v_numoutput
rather than BC_BUSY. Removes the dependency on bufhash.


Revision tags: bouyer-xenpvh-base1 phil-wifi-20200411
# 1.293 11-Apr-2020 jdolecek

for bmempools set align, not ioff


# 1.292 11-Apr-2020 jdolecek

explicitly use DEV_BSIZE align for all bmempools

this is required for Xen xbd(4) in order to not have to use bounce buffers

the alignment is implicitly provided when POOL_REDZONE is not active,
this change makes it also aligned when POOL_REDZONE _is_ active - that is
when (!KMSAN && (DIAGNOSTIC || KASAN))


# 1.291 10-Apr-2020 ad

Remove buffer reference counting, now that it's safe to destroy b_busy after
waking any waiters.


Revision tags: bouyer-xenpvh-base phil-wifi-20200406
# 1.290 14-Mar-2020 ad

branches: 1.290.2;
- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small
functions: preempt_point() and preempt_needed().

- preempt(): if the LWP has exceeded its timeslice in kernel, strip it of
any priority boost gained earlier from blocking.


Revision tags: is-mlppp-base ad-namecache-base3
# 1.289 21-Feb-2020 riastradh

OOPS -- fix mistake in previous commit.

bbusy really needs to return the error; otherwise things are very
bad!


# 1.288 20-Feb-2020 riastradh

Buffer cache SDT probes.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.287 17-Jan-2020 ad

biodone2(): don't acquire kernel_lock for anybody anymore.


Revision tags: ad-namecache-base
# 1.286 31-Dec-2019 ad

branches: 1.286.2;
Rename uvm_free() -> uvm_availmem().


# 1.285 27-Dec-2019 msaitoh

s/transfered/transferred/


# 1.284 21-Dec-2019 ad

uvmexp.free -> uvm_free()


# 1.283 11-Dec-2019 ad

Add a comment.


# 1.282 08-Dec-2019 ad

For safety, cv_broadcast(&bp->b_busy) in more places where the buffer is
changing identity or moving from one vnode list to another.


# 1.281 08-Dec-2019 ad

Adjustment to previous: if we're going to toss the buffer, then wake
everybody.


# 1.280 08-Dec-2019 ad

- Avoid thundering herd: cv_broadcast(&bp->b_busy) -> cv_signal(&bp->b_busy)
- Sprinkle __cacheline_aligned.


Revision tags: phil-wifi-20191119
# 1.279 26-Aug-2019 msaitoh

Change buf_nbuf()'s return value from int to u_int to avoid undefined
behavior in wapbl_start() which extended int to size_t.

Error message was:
> UBSan: Undefined Behavior in ../../../../kern/vfs_wapbl.c:609:41, signed integer overflow: 3345138 * 1024 cannot be represented in type 'int'

> /* XXX maybe use filesystem fragment size instead of 1024 */
> /* XXX fix actual number of buffers reserved per filesystem. */
> wl->wl_bufcount_max = (buf_nbuf() / 2) * 1024;

Need more work?


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126
# 1.278 24-Nov-2018 maxv

Fix kernel pointer leaks in sysctl_dobuf. While here constify argument.

Also memset the buffer, to prevent leaks (even if there doesn't seem to
be currently).


Revision tags: pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.277 29-Aug-2018 hannken

Make sure getnewbuf() runs bawrite() inside fstrans.
Use fstrans_start_nowait() to skip buffers that would block.


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.276 28-Oct-2017 pgoyette

branches: 1.276.2; 1.276.4;
Update the kernhist(9) kernel history code to address issues identified
in PR kern/52639, as well as some general cleaning-up...

(As proposed on tech-kern@ with additional changes and enhancements.)

Details of changes:

* All history arguments are now stored as uintmax_t values[1], both in
the kernel and in the structures used for exporting the history data
to userland via sysctl(9). This avoids problems on some architectures
where passing a 64-bit (or larger) value to printf(3) can cause it to
process the value as multiple arguments. (This can be particularly
problematic when printf()'s format string is not a literal, since in
that case the compiler cannot know how large each argument should be.)

* Update the data structures used for exporting kernel history data to
include a version number as well as the length of history arguments.

* All [2] existing users of kernhist(9) have had their format strings
updated. Each format specifier now includes an explicit length
modifier 'j' to refer to numeric values of the size of uintmax_t.

* All [2] existing users of kernhist(9) have had their format strings
updated to replace uses of "%p" with "%#jx", and the pointer
arguments are now cast to (uintptr_t) before being subsequently cast
to (uintmax_t). This is needed to avoid compiler warnings about
casting "pointer to integer of a different size."

* All [2] existing users of kernhist(9) have had instances of "%s" or
"%c" format strings replaced with numeric formats; several instances
of mis-match between format string and argument list have been fixed.

* vmstat(1) has been modified to handle the new size of arguments in the
history data as exported by sysctl(9).

* vmstat(1) now provides a warning message if the history requested with
the -u option does not exist (previously, this condition was silently
ignored, with only a single blank line being printed).

* vmstat(1) now checks the version and argument length included in the
data exported via sysctl(9) and exits if they do not match the values
with which vmstat was built.

* The kernhist(9) man-page has been updated to note the additional
requirements imposed on the format strings, along with several other
minor changes and enhancements.

[1] It would have been possible to use an explicit length (for example,
uint64_t) for the history arguments. But that would require another
"rototill" of all the users in the future when we add support for an
architecture that supports a larger size. Also, the printf(3) format
specifiers for explicitly-sized values, such as "%"PRIu64, are much
more verbose (and less aesthetically appealing, IMHO) than simply
using "%ju".

[2] I've tried very hard to find "all [the] existing users of kernhist(9)"
but it is possible that I've missed some of them. I would be glad to
update any stragglers that anyone identifies.


Revision tags: nick-nhusb-base-20170825
# 1.275 04-Aug-2017 mrg

normalise a BIOHIST log message


Revision tags: perseant-stdc-iso10646-base
# 1.274 08-Jun-2017 chs

move some buffer cache internals declarations from buf.h to vfs_bio.c.
this is needed to avoid name conflicts with ZFS and also
makes it clearer that other code shouldn't be messing with these.
remove the LFS debug code that poked around in bufqueues and
remove the BQ_EMPTY bufqueue since nothing uses it anymore.
provide a function to let LFS and wapbl read the value of nbuf for now.


Revision tags: netbsd-8-base
# 1.273 25-May-2017 pgoyette

branches: 1.273.2;
When logging a history record for biowait(), include the return address
as a parameter, to identify to which of the many calls to biowait() the
record refers.


Revision tags: prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.272 05-Apr-2017 jdolecek

expose disk device FUA/DPO support via DIOCGCACHE, and allow the flags
to be set for I/O; implement support in sd(4) and nvme(4)

discussed on tech-kern


# 1.271 21-Mar-2017 skrll

Use brelsel while the bufcache_lock is held rather than dropping it
and re-taking / dropping it in brelse


Revision tags: pgoyette-localcount-20170320
# 1.270 18-Mar-2017 riastradh

Nix trailing whitespace.


Revision tags: nick-nhusb-base-20170204
# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

branches: 1.267.2;
Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-2-RELEASE netbsd-7-1-2-RELEASE netbsd-7-1-1-RELEASE netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


# 1.294 20-Apr-2020 ad

Rename buf_syncwait() to vfs_syncwait(), and have it wait on v_numoutput
rather than BC_BUSY. Removes the dependency on bufhash.


Revision tags: bouyer-xenpvh-base1 phil-wifi-20200411
# 1.293 11-Apr-2020 jdolecek

for bmempools set align, not ioff


# 1.292 11-Apr-2020 jdolecek

explicitly use DEV_BSIZE align for all bmempools

this is required for Xen xbd(4) in order to not have to use bounce buffers

the alignment is implicitly provided when POOL_REDZONE is not active,
this change makes it also aligned when POOL_REDZONE _is_ active - that is
when (!KMSAN && (DIAGNOSTIC || KASAN))


# 1.291 10-Apr-2020 ad

Remove buffer reference counting, now that it's safe to destroy b_busy after
waking any waiters.


Revision tags: bouyer-xenpvh-base phil-wifi-20200406
# 1.290 14-Mar-2020 ad

branches: 1.290.2;
- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small
functions: preempt_point() and preempt_needed().

- preempt(): if the LWP has exceeded its timeslice in kernel, strip it of
any priority boost gained earlier from blocking.


Revision tags: is-mlppp-base ad-namecache-base3
# 1.289 21-Feb-2020 riastradh

OOPS -- fix mistake in previous commit.

bbusy really needs to return the error; otherwise things are very
bad!


# 1.288 20-Feb-2020 riastradh

Buffer cache SDT probes.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.287 17-Jan-2020 ad

biodone2(): don't acquire kernel_lock for anybody anymore.


Revision tags: ad-namecache-base
# 1.286 31-Dec-2019 ad

branches: 1.286.2;
Rename uvm_free() -> uvm_availmem().


# 1.285 27-Dec-2019 msaitoh

s/transfered/transferred/


# 1.284 21-Dec-2019 ad

uvmexp.free -> uvm_free()


# 1.283 11-Dec-2019 ad

Add a comment.


# 1.282 08-Dec-2019 ad

For safety, cv_broadcast(&bp->b_busy) in more places where the buffer is
changing identity or moving from one vnode list to another.


# 1.281 08-Dec-2019 ad

Adjustment to previous: if we're going to toss the buffer, then wake
everybody.


# 1.280 08-Dec-2019 ad

- Avoid thundering herd: cv_broadcast(&bp->b_busy) -> cv_signal(&bp->b_busy)
- Sprinkle __cacheline_aligned.


Revision tags: phil-wifi-20191119
# 1.279 26-Aug-2019 msaitoh

Change buf_nbuf()'s return value from int to u_int to avoid undefined
behavior in wapbl_start() which extended int to size_t.

Error message was:
> UBSan: Undefined Behavior in ../../../../kern/vfs_wapbl.c:609:41, signed integer overflow: 3345138 * 1024 cannot be represented in type 'int'

> /* XXX maybe use filesystem fragment size instead of 1024 */
> /* XXX fix actual number of buffers reserved per filesystem. */
> wl->wl_bufcount_max = (buf_nbuf() / 2) * 1024;

Need more work?


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126
# 1.278 24-Nov-2018 maxv

Fix kernel pointer leaks in sysctl_dobuf. While here constify argument.

Also memset the buffer, to prevent leaks (even if there doesn't seem to
be currently).


Revision tags: pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.277 29-Aug-2018 hannken

Make sure getnewbuf() runs bawrite() inside fstrans.
Use fstrans_start_nowait() to skip buffers that would block.


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.276 28-Oct-2017 pgoyette

branches: 1.276.2; 1.276.4;
Update the kernhist(9) kernel history code to address issues identified
in PR kern/52639, as well as some general cleaning-up...

(As proposed on tech-kern@ with additional changes and enhancements.)

Details of changes:

* All history arguments are now stored as uintmax_t values[1], both in
the kernel and in the structures used for exporting the history data
to userland via sysctl(9). This avoids problems on some architectures
where passing a 64-bit (or larger) value to printf(3) can cause it to
process the value as multiple arguments. (This can be particularly
problematic when printf()'s format string is not a literal, since in
that case the compiler cannot know how large each argument should be.)

* Update the data structures used for exporting kernel history data to
include a version number as well as the length of history arguments.

* All [2] existing users of kernhist(9) have had their format strings
updated. Each format specifier now includes an explicit length
modifier 'j' to refer to numeric values of the size of uintmax_t.

* All [2] existing users of kernhist(9) have had their format strings
updated to replace uses of "%p" with "%#jx", and the pointer
arguments are now cast to (uintptr_t) before being subsequently cast
to (uintmax_t). This is needed to avoid compiler warnings about
casting "pointer to integer of a different size."

* All [2] existing users of kernhist(9) have had instances of "%s" or
"%c" format strings replaced with numeric formats; several instances
of mis-match between format string and argument list have been fixed.

* vmstat(1) has been modified to handle the new size of arguments in the
history data as exported by sysctl(9).

* vmstat(1) now provides a warning message if the history requested with
the -u option does not exist (previously, this condition was silently
ignored, with only a single blank line being printed).

* vmstat(1) now checks the version and argument length included in the
data exported via sysctl(9) and exits if they do not match the values
with which vmstat was built.

* The kernhist(9) man-page has been updated to note the additional
requirements imposed on the format strings, along with several other
minor changes and enhancements.

[1] It would have been possible to use an explicit length (for example,
uint64_t) for the history arguments. But that would require another
"rototill" of all the users in the future when we add support for an
architecture that supports a larger size. Also, the printf(3) format
specifiers for explicitly-sized values, such as "%"PRIu64, are much
more verbose (and less aesthetically appealing, IMHO) than simply
using "%ju".

[2] I've tried very hard to find "all [the] existing users of kernhist(9)"
but it is possible that I've missed some of them. I would be glad to
update any stragglers that anyone identifies.


Revision tags: nick-nhusb-base-20170825
# 1.275 04-Aug-2017 mrg

normalise a BIOHIST log message


Revision tags: perseant-stdc-iso10646-base
# 1.274 08-Jun-2017 chs

move some buffer cache internals declarations from buf.h to vfs_bio.c.
this is needed to avoid name conflicts with ZFS and also
makes it clearer that other code shouldn't be messing with these.
remove the LFS debug code that poked around in bufqueues and
remove the BQ_EMPTY bufqueue since nothing uses it anymore.
provide a function to let LFS and wapbl read the value of nbuf for now.


Revision tags: netbsd-8-base
# 1.273 25-May-2017 pgoyette

branches: 1.273.2;
When logging a history record for biowait(), include the return address
as a parameter, to identify to which of the many calls to biowait() the
record refers.


Revision tags: prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.272 05-Apr-2017 jdolecek

expose disk device FUA/DPO support via DIOCGCACHE, and allow the flags
to be set for I/O; implement support in sd(4) and nvme(4)

discussed on tech-kern


# 1.271 21-Mar-2017 skrll

Use brelsel while the bufcache_lock is held rather than dropping it
and re-taking / dropping it in brelse


Revision tags: pgoyette-localcount-20170320
# 1.270 18-Mar-2017 riastradh

Nix trailing whitespace.


Revision tags: nick-nhusb-base-20170204
# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

branches: 1.267.2;
Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-2-RELEASE netbsd-7-1-2-RELEASE netbsd-7-1-1-RELEASE netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


Revision tags: phil-wifi-20200411
# 1.293 11-Apr-2020 jdolecek

for bmempools set align, not ioff


# 1.292 11-Apr-2020 jdolecek

explicitly use DEV_BSIZE align for all bmempools

this is required for Xen xbd(4) in order to not have to use bounce buffers

the alignment is implicitly provided when POOL_REDZONE is not active,
this change makes it also aligned when POOL_REDZONE _is_ active - that is
when (!KMSAN && (DIAGNOSTIC || KASAN))


# 1.291 10-Apr-2020 ad

Remove buffer reference counting, now that it's safe to destroy b_busy after
waking any waiters.


Revision tags: bouyer-xenpvh-base phil-wifi-20200406
# 1.290 14-Mar-2020 ad

- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small
functions: preempt_point() and preempt_needed().

- preempt(): if the LWP has exceeded its timeslice in kernel, strip it of
any priority boost gained earlier from blocking.


Revision tags: is-mlppp-base ad-namecache-base3
# 1.289 21-Feb-2020 riastradh

OOPS -- fix mistake in previous commit.

bbusy really needs to return the error; otherwise things are very
bad!


# 1.288 20-Feb-2020 riastradh

Buffer cache SDT probes.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.287 17-Jan-2020 ad

biodone2(): don't acquire kernel_lock for anybody anymore.


Revision tags: ad-namecache-base
# 1.286 31-Dec-2019 ad

branches: 1.286.2;
Rename uvm_free() -> uvm_availmem().


# 1.285 27-Dec-2019 msaitoh

s/transfered/transferred/


# 1.284 21-Dec-2019 ad

uvmexp.free -> uvm_free()


# 1.283 11-Dec-2019 ad

Add a comment.


# 1.282 08-Dec-2019 ad

For safety, cv_broadcast(&bp->b_busy) in more places where the buffer is
changing identity or moving from one vnode list to another.


# 1.281 08-Dec-2019 ad

Adjustment to previous: if we're going to toss the buffer, then wake
everybody.


# 1.280 08-Dec-2019 ad

- Avoid thundering herd: cv_broadcast(&bp->b_busy) -> cv_signal(&bp->b_busy)
- Sprinkle __cacheline_aligned.


Revision tags: phil-wifi-20191119
# 1.279 26-Aug-2019 msaitoh

Change buf_nbuf()'s return value from int to u_int to avoid undefined
behavior in wapbl_start() which extended int to size_t.

Error message was:
> UBSan: Undefined Behavior in ../../../../kern/vfs_wapbl.c:609:41, signed integer overflow: 3345138 * 1024 cannot be represented in type 'int'

> /* XXX maybe use filesystem fragment size instead of 1024 */
> /* XXX fix actual number of buffers reserved per filesystem. */
> wl->wl_bufcount_max = (buf_nbuf() / 2) * 1024;

Need more work?


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126
# 1.278 24-Nov-2018 maxv

Fix kernel pointer leaks in sysctl_dobuf. While here constify argument.

Also memset the buffer, to prevent leaks (even if there doesn't seem to
be currently).


Revision tags: pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.277 29-Aug-2018 hannken

Make sure getnewbuf() runs bawrite() inside fstrans.
Use fstrans_start_nowait() to skip buffers that would block.


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.276 28-Oct-2017 pgoyette

branches: 1.276.2; 1.276.4;
Update the kernhist(9) kernel history code to address issues identified
in PR kern/52639, as well as some general cleaning-up...

(As proposed on tech-kern@ with additional changes and enhancements.)

Details of changes:

* All history arguments are now stored as uintmax_t values[1], both in
the kernel and in the structures used for exporting the history data
to userland via sysctl(9). This avoids problems on some architectures
where passing a 64-bit (or larger) value to printf(3) can cause it to
process the value as multiple arguments. (This can be particularly
problematic when printf()'s format string is not a literal, since in
that case the compiler cannot know how large each argument should be.)

* Update the data structures used for exporting kernel history data to
include a version number as well as the length of history arguments.

* All [2] existing users of kernhist(9) have had their format strings
updated. Each format specifier now includes an explicit length
modifier 'j' to refer to numeric values of the size of uintmax_t.

* All [2] existing users of kernhist(9) have had their format strings
updated to replace uses of "%p" with "%#jx", and the pointer
arguments are now cast to (uintptr_t) before being subsequently cast
to (uintmax_t). This is needed to avoid compiler warnings about
casting "pointer to integer of a different size."

* All [2] existing users of kernhist(9) have had instances of "%s" or
"%c" format strings replaced with numeric formats; several instances
of mis-match between format string and argument list have been fixed.

* vmstat(1) has been modified to handle the new size of arguments in the
history data as exported by sysctl(9).

* vmstat(1) now provides a warning message if the history requested with
the -u option does not exist (previously, this condition was silently
ignored, with only a single blank line being printed).

* vmstat(1) now checks the version and argument length included in the
data exported via sysctl(9) and exits if they do not match the values
with which vmstat was built.

* The kernhist(9) man-page has been updated to note the additional
requirements imposed on the format strings, along with several other
minor changes and enhancements.

[1] It would have been possible to use an explicit length (for example,
uint64_t) for the history arguments. But that would require another
"rototill" of all the users in the future when we add support for an
architecture that supports a larger size. Also, the printf(3) format
specifiers for explicitly-sized values, such as "%"PRIu64, are much
more verbose (and less aesthetically appealing, IMHO) than simply
using "%ju".

[2] I've tried very hard to find "all [the] existing users of kernhist(9)"
but it is possible that I've missed some of them. I would be glad to
update any stragglers that anyone identifies.


Revision tags: nick-nhusb-base-20170825
# 1.275 04-Aug-2017 mrg

normalise a BIOHIST log message


Revision tags: perseant-stdc-iso10646-base
# 1.274 08-Jun-2017 chs

move some buffer cache internals declarations from buf.h to vfs_bio.c.
this is needed to avoid name conflicts with ZFS and also
makes it clearer that other code shouldn't be messing with these.
remove the LFS debug code that poked around in bufqueues and
remove the BQ_EMPTY bufqueue since nothing uses it anymore.
provide a function to let LFS and wapbl read the value of nbuf for now.


Revision tags: netbsd-8-base
# 1.273 25-May-2017 pgoyette

branches: 1.273.2;
When logging a history record for biowait(), include the return address
as a parameter, to identify to which of the many calls to biowait() the
record refers.


Revision tags: prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.272 05-Apr-2017 jdolecek

expose disk device FUA/DPO support via DIOCGCACHE, and allow the flags
to be set for I/O; implement support in sd(4) and nvme(4)

discussed on tech-kern


# 1.271 21-Mar-2017 skrll

Use brelsel while the bufcache_lock is held rather than dropping it
and re-taking / dropping it in brelse


Revision tags: pgoyette-localcount-20170320
# 1.270 18-Mar-2017 riastradh

Nix trailing whitespace.


Revision tags: nick-nhusb-base-20170204
# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

branches: 1.267.2;
Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-2-RELEASE netbsd-7-1-2-RELEASE netbsd-7-1-1-RELEASE netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


# 1.291 10-Apr-2020 ad

Remove buffer reference counting, now that it's safe to destroy b_busy after
waking any waiters.


Revision tags: bouyer-xenpvh-base phil-wifi-20200406
# 1.290 14-Mar-2020 ad

- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small
functions: preempt_point() and preempt_needed().

- preempt(): if the LWP has exceeded its timeslice in kernel, strip it of
any priority boost gained earlier from blocking.


Revision tags: is-mlppp-base ad-namecache-base3
# 1.289 21-Feb-2020 riastradh

OOPS -- fix mistake in previous commit.

bbusy really needs to return the error; otherwise things are very
bad!


# 1.288 20-Feb-2020 riastradh

Buffer cache SDT probes.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.287 17-Jan-2020 ad

biodone2(): don't acquire kernel_lock for anybody anymore.


Revision tags: ad-namecache-base
# 1.286 31-Dec-2019 ad

branches: 1.286.2;
Rename uvm_free() -> uvm_availmem().


# 1.285 27-Dec-2019 msaitoh

s/transfered/transferred/


# 1.284 21-Dec-2019 ad

uvmexp.free -> uvm_free()


# 1.283 11-Dec-2019 ad

Add a comment.


# 1.282 08-Dec-2019 ad

For safety, cv_broadcast(&bp->b_busy) in more places where the buffer is
changing identity or moving from one vnode list to another.


# 1.281 08-Dec-2019 ad

Adjustment to previous: if we're going to toss the buffer, then wake
everybody.


# 1.280 08-Dec-2019 ad

- Avoid thundering herd: cv_broadcast(&bp->b_busy) -> cv_signal(&bp->b_busy)
- Sprinkle __cacheline_aligned.


Revision tags: phil-wifi-20191119
# 1.279 26-Aug-2019 msaitoh

Change buf_nbuf()'s return value from int to u_int to avoid undefined
behavior in wapbl_start() which extended int to size_t.

Error message was:
> UBSan: Undefined Behavior in ../../../../kern/vfs_wapbl.c:609:41, signed integer overflow: 3345138 * 1024 cannot be represented in type 'int'

> /* XXX maybe use filesystem fragment size instead of 1024 */
> /* XXX fix actual number of buffers reserved per filesystem. */
> wl->wl_bufcount_max = (buf_nbuf() / 2) * 1024;

Need more work?


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126
# 1.278 24-Nov-2018 maxv

Fix kernel pointer leaks in sysctl_dobuf. While here constify argument.

Also memset the buffer, to prevent leaks (even if there doesn't seem to
be currently).


Revision tags: pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.277 29-Aug-2018 hannken

Make sure getnewbuf() runs bawrite() inside fstrans.
Use fstrans_start_nowait() to skip buffers that would block.


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.276 28-Oct-2017 pgoyette

branches: 1.276.2; 1.276.4;
Update the kernhist(9) kernel history code to address issues identified
in PR kern/52639, as well as some general cleaning-up...

(As proposed on tech-kern@ with additional changes and enhancements.)

Details of changes:

* All history arguments are now stored as uintmax_t values[1], both in
the kernel and in the structures used for exporting the history data
to userland via sysctl(9). This avoids problems on some architectures
where passing a 64-bit (or larger) value to printf(3) can cause it to
process the value as multiple arguments. (This can be particularly
problematic when printf()'s format string is not a literal, since in
that case the compiler cannot know how large each argument should be.)

* Update the data structures used for exporting kernel history data to
include a version number as well as the length of history arguments.

* All [2] existing users of kernhist(9) have had their format strings
updated. Each format specifier now includes an explicit length
modifier 'j' to refer to numeric values of the size of uintmax_t.

* All [2] existing users of kernhist(9) have had their format strings
updated to replace uses of "%p" with "%#jx", and the pointer
arguments are now cast to (uintptr_t) before being subsequently cast
to (uintmax_t). This is needed to avoid compiler warnings about
casting "pointer to integer of a different size."

* All [2] existing users of kernhist(9) have had instances of "%s" or
"%c" format strings replaced with numeric formats; several instances
of mis-match between format string and argument list have been fixed.

* vmstat(1) has been modified to handle the new size of arguments in the
history data as exported by sysctl(9).

* vmstat(1) now provides a warning message if the history requested with
the -u option does not exist (previously, this condition was silently
ignored, with only a single blank line being printed).

* vmstat(1) now checks the version and argument length included in the
data exported via sysctl(9) and exits if they do not match the values
with which vmstat was built.

* The kernhist(9) man-page has been updated to note the additional
requirements imposed on the format strings, along with several other
minor changes and enhancements.

[1] It would have been possible to use an explicit length (for example,
uint64_t) for the history arguments. But that would require another
"rototill" of all the users in the future when we add support for an
architecture that supports a larger size. Also, the printf(3) format
specifiers for explicitly-sized values, such as "%"PRIu64, are much
more verbose (and less aesthetically appealing, IMHO) than simply
using "%ju".

[2] I've tried very hard to find "all [the] existing users of kernhist(9)"
but it is possible that I've missed some of them. I would be glad to
update any stragglers that anyone identifies.


Revision tags: nick-nhusb-base-20170825
# 1.275 04-Aug-2017 mrg

normalise a BIOHIST log message


Revision tags: perseant-stdc-iso10646-base
# 1.274 08-Jun-2017 chs

move some buffer cache internals declarations from buf.h to vfs_bio.c.
this is needed to avoid name conflicts with ZFS and also
makes it clearer that other code shouldn't be messing with these.
remove the LFS debug code that poked around in bufqueues and
remove the BQ_EMPTY bufqueue since nothing uses it anymore.
provide a function to let LFS and wapbl read the value of nbuf for now.


Revision tags: netbsd-8-base
# 1.273 25-May-2017 pgoyette

branches: 1.273.2;
When logging a history record for biowait(), include the return address
as a parameter, to identify to which of the many calls to biowait() the
record refers.


Revision tags: prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.272 05-Apr-2017 jdolecek

expose disk device FUA/DPO support via DIOCGCACHE, and allow the flags
to be set for I/O; implement support in sd(4) and nvme(4)

discussed on tech-kern


# 1.271 21-Mar-2017 skrll

Use brelsel while the bufcache_lock is held rather than dropping it
and re-taking / dropping it in brelse


Revision tags: pgoyette-localcount-20170320
# 1.270 18-Mar-2017 riastradh

Nix trailing whitespace.


Revision tags: nick-nhusb-base-20170204
# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

branches: 1.267.2;
Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-2-RELEASE netbsd-7-1-2-RELEASE netbsd-7-1-1-RELEASE netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


# 1.290 14-Mar-2020 ad

- Hide the details of SPCF_SHOULDYIELD and related behind a couple of small
functions: preempt_point() and preempt_needed().

- preempt(): if the LWP has exceeded its timeslice in kernel, strip it of
any priority boost gained earlier from blocking.


Revision tags: ad-namecache-base3
# 1.289 21-Feb-2020 riastradh

OOPS -- fix mistake in previous commit.

bbusy really needs to return the error; otherwise things are very
bad!


# 1.288 20-Feb-2020 riastradh

Buffer cache SDT probes.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.287 17-Jan-2020 ad

biodone2(): don't acquire kernel_lock for anybody anymore.


Revision tags: ad-namecache-base
# 1.286 31-Dec-2019 ad

branches: 1.286.2;
Rename uvm_free() -> uvm_availmem().


# 1.285 27-Dec-2019 msaitoh

s/transfered/transferred/


# 1.284 21-Dec-2019 ad

uvmexp.free -> uvm_free()


# 1.283 11-Dec-2019 ad

Add a comment.


# 1.282 08-Dec-2019 ad

For safety, cv_broadcast(&bp->b_busy) in more places where the buffer is
changing identity or moving from one vnode list to another.


# 1.281 08-Dec-2019 ad

Adjustment to previous: if we're going to toss the buffer, then wake
everybody.


# 1.280 08-Dec-2019 ad

- Avoid thundering herd: cv_broadcast(&bp->b_busy) -> cv_signal(&bp->b_busy)
- Sprinkle __cacheline_aligned.


Revision tags: phil-wifi-20191119
# 1.279 26-Aug-2019 msaitoh

Change buf_nbuf()'s return value from int to u_int to avoid undefined
behavior in wapbl_start() which extended int to size_t.

Error message was:
> UBSan: Undefined Behavior in ../../../../kern/vfs_wapbl.c:609:41, signed integer overflow: 3345138 * 1024 cannot be represented in type 'int'

> /* XXX maybe use filesystem fragment size instead of 1024 */
> /* XXX fix actual number of buffers reserved per filesystem. */
> wl->wl_bufcount_max = (buf_nbuf() / 2) * 1024;

Need more work?


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126
# 1.278 24-Nov-2018 maxv

Fix kernel pointer leaks in sysctl_dobuf. While here constify argument.

Also memset the buffer, to prevent leaks (even if there doesn't seem to
be currently).


Revision tags: pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.277 29-Aug-2018 hannken

Make sure getnewbuf() runs bawrite() inside fstrans.
Use fstrans_start_nowait() to skip buffers that would block.


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.276 28-Oct-2017 pgoyette

branches: 1.276.2; 1.276.4;
Update the kernhist(9) kernel history code to address issues identified
in PR kern/52639, as well as some general cleaning-up...

(As proposed on tech-kern@ with additional changes and enhancements.)

Details of changes:

* All history arguments are now stored as uintmax_t values[1], both in
the kernel and in the structures used for exporting the history data
to userland via sysctl(9). This avoids problems on some architectures
where passing a 64-bit (or larger) value to printf(3) can cause it to
process the value as multiple arguments. (This can be particularly
problematic when printf()'s format string is not a literal, since in
that case the compiler cannot know how large each argument should be.)

* Update the data structures used for exporting kernel history data to
include a version number as well as the length of history arguments.

* All [2] existing users of kernhist(9) have had their format strings
updated. Each format specifier now includes an explicit length
modifier 'j' to refer to numeric values of the size of uintmax_t.

* All [2] existing users of kernhist(9) have had their format strings
updated to replace uses of "%p" with "%#jx", and the pointer
arguments are now cast to (uintptr_t) before being subsequently cast
to (uintmax_t). This is needed to avoid compiler warnings about
casting "pointer to integer of a different size."

* All [2] existing users of kernhist(9) have had instances of "%s" or
"%c" format strings replaced with numeric formats; several instances
of mis-match between format string and argument list have been fixed.

* vmstat(1) has been modified to handle the new size of arguments in the
history data as exported by sysctl(9).

* vmstat(1) now provides a warning message if the history requested with
the -u option does not exist (previously, this condition was silently
ignored, with only a single blank line being printed).

* vmstat(1) now checks the version and argument length included in the
data exported via sysctl(9) and exits if they do not match the values
with which vmstat was built.

* The kernhist(9) man-page has been updated to note the additional
requirements imposed on the format strings, along with several other
minor changes and enhancements.

[1] It would have been possible to use an explicit length (for example,
uint64_t) for the history arguments. But that would require another
"rototill" of all the users in the future when we add support for an
architecture that supports a larger size. Also, the printf(3) format
specifiers for explicitly-sized values, such as "%"PRIu64, are much
more verbose (and less aesthetically appealing, IMHO) than simply
using "%ju".

[2] I've tried very hard to find "all [the] existing users of kernhist(9)"
but it is possible that I've missed some of them. I would be glad to
update any stragglers that anyone identifies.


Revision tags: nick-nhusb-base-20170825
# 1.275 04-Aug-2017 mrg

normalise a BIOHIST log message


Revision tags: perseant-stdc-iso10646-base
# 1.274 08-Jun-2017 chs

move some buffer cache internals declarations from buf.h to vfs_bio.c.
this is needed to avoid name conflicts with ZFS and also
makes it clearer that other code shouldn't be messing with these.
remove the LFS debug code that poked around in bufqueues and
remove the BQ_EMPTY bufqueue since nothing uses it anymore.
provide a function to let LFS and wapbl read the value of nbuf for now.


Revision tags: netbsd-8-base
# 1.273 25-May-2017 pgoyette

branches: 1.273.2;
When logging a history record for biowait(), include the return address
as a parameter, to identify to which of the many calls to biowait() the
record refers.


Revision tags: prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.272 05-Apr-2017 jdolecek

expose disk device FUA/DPO support via DIOCGCACHE, and allow the flags
to be set for I/O; implement support in sd(4) and nvme(4)

discussed on tech-kern


# 1.271 21-Mar-2017 skrll

Use brelsel while the bufcache_lock is held rather than dropping it
and re-taking / dropping it in brelse


Revision tags: pgoyette-localcount-20170320
# 1.270 18-Mar-2017 riastradh

Nix trailing whitespace.


Revision tags: nick-nhusb-base-20170204
# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

branches: 1.267.2;
Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-2-RELEASE netbsd-7-1-2-RELEASE netbsd-7-1-1-RELEASE netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


# 1.289 21-Feb-2020 riastradh

OOPS -- fix mistake in previous commit.

bbusy really needs to return the error; otherwise things are very
bad!


# 1.288 20-Feb-2020 riastradh

Buffer cache SDT probes.


Revision tags: ad-namecache-base2 ad-namecache-base1
# 1.287 17-Jan-2020 ad

biodone2(): don't acquire kernel_lock for anybody anymore.


Revision tags: ad-namecache-base
# 1.286 31-Dec-2019 ad

branches: 1.286.2;
Rename uvm_free() -> uvm_availmem().


# 1.285 27-Dec-2019 msaitoh

s/transfered/transferred/


# 1.284 21-Dec-2019 ad

uvmexp.free -> uvm_free()


# 1.283 11-Dec-2019 ad

Add a comment.


# 1.282 08-Dec-2019 ad

For safety, cv_broadcast(&bp->b_busy) in more places where the buffer is
changing identity or moving from one vnode list to another.


# 1.281 08-Dec-2019 ad

Adjustment to previous: if we're going to toss the buffer, then wake
everybody.


# 1.280 08-Dec-2019 ad

- Avoid thundering herd: cv_broadcast(&bp->b_busy) -> cv_signal(&bp->b_busy)
- Sprinkle __cacheline_aligned.


Revision tags: phil-wifi-20191119
# 1.279 26-Aug-2019 msaitoh

Change buf_nbuf()'s return value from int to u_int to avoid undefined
behavior in wapbl_start() which extended int to size_t.

Error message was:
> UBSan: Undefined Behavior in ../../../../kern/vfs_wapbl.c:609:41, signed integer overflow: 3345138 * 1024 cannot be represented in type 'int'

> /* XXX maybe use filesystem fragment size instead of 1024 */
> /* XXX fix actual number of buffers reserved per filesystem. */
> wl->wl_bufcount_max = (buf_nbuf() / 2) * 1024;

Need more work?


Revision tags: netbsd-9-0-RELEASE netbsd-9-0-RC2 netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126
# 1.278 24-Nov-2018 maxv

Fix kernel pointer leaks in sysctl_dobuf. While here constify argument.

Also memset the buffer, to prevent leaks (even if there doesn't seem to
be currently).


Revision tags: pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.277 29-Aug-2018 hannken

Make sure getnewbuf() runs bawrite() inside fstrans.
Use fstrans_start_nowait() to skip buffers that would block.


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.276 28-Oct-2017 pgoyette

branches: 1.276.2; 1.276.4;
Update the kernhist(9) kernel history code to address issues identified
in PR kern/52639, as well as some general cleaning-up...

(As proposed on tech-kern@ with additional changes and enhancements.)

Details of changes:

* All history arguments are now stored as uintmax_t values[1], both in
the kernel and in the structures used for exporting the history data
to userland via sysctl(9). This avoids problems on some architectures
where passing a 64-bit (or larger) value to printf(3) can cause it to
process the value as multiple arguments. (This can be particularly
problematic when printf()'s format string is not a literal, since in
that case the compiler cannot know how large each argument should be.)

* Update the data structures used for exporting kernel history data to
include a version number as well as the length of history arguments.

* All [2] existing users of kernhist(9) have had their format strings
updated. Each format specifier now includes an explicit length
modifier 'j' to refer to numeric values of the size of uintmax_t.

* All [2] existing users of kernhist(9) have had their format strings
updated to replace uses of "%p" with "%#jx", and the pointer
arguments are now cast to (uintptr_t) before being subsequently cast
to (uintmax_t). This is needed to avoid compiler warnings about
casting "pointer to integer of a different size."

* All [2] existing users of kernhist(9) have had instances of "%s" or
"%c" format strings replaced with numeric formats; several instances
of mis-match between format string and argument list have been fixed.

* vmstat(1) has been modified to handle the new size of arguments in the
history data as exported by sysctl(9).

* vmstat(1) now provides a warning message if the history requested with
the -u option does not exist (previously, this condition was silently
ignored, with only a single blank line being printed).

* vmstat(1) now checks the version and argument length included in the
data exported via sysctl(9) and exits if they do not match the values
with which vmstat was built.

* The kernhist(9) man-page has been updated to note the additional
requirements imposed on the format strings, along with several other
minor changes and enhancements.

[1] It would have been possible to use an explicit length (for example,
uint64_t) for the history arguments. But that would require another
"rototill" of all the users in the future when we add support for an
architecture that supports a larger size. Also, the printf(3) format
specifiers for explicitly-sized values, such as "%"PRIu64, are much
more verbose (and less aesthetically appealing, IMHO) than simply
using "%ju".

[2] I've tried very hard to find "all [the] existing users of kernhist(9)"
but it is possible that I've missed some of them. I would be glad to
update any stragglers that anyone identifies.


Revision tags: nick-nhusb-base-20170825
# 1.275 04-Aug-2017 mrg

normalise a BIOHIST log message


Revision tags: perseant-stdc-iso10646-base
# 1.274 08-Jun-2017 chs

move some buffer cache internals declarations from buf.h to vfs_bio.c.
this is needed to avoid name conflicts with ZFS and also
makes it clearer that other code shouldn't be messing with these.
remove the LFS debug code that poked around in bufqueues and
remove the BQ_EMPTY bufqueue since nothing uses it anymore.
provide a function to let LFS and wapbl read the value of nbuf for now.


Revision tags: netbsd-8-base
# 1.273 25-May-2017 pgoyette

branches: 1.273.2;
When logging a history record for biowait(), include the return address
as a parameter, to identify to which of the many calls to biowait() the
record refers.


Revision tags: prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.272 05-Apr-2017 jdolecek

expose disk device FUA/DPO support via DIOCGCACHE, and allow the flags
to be set for I/O; implement support in sd(4) and nvme(4)

discussed on tech-kern


# 1.271 21-Mar-2017 skrll

Use brelsel while the bufcache_lock is held rather than dropping it
and re-taking / dropping it in brelse


Revision tags: pgoyette-localcount-20170320
# 1.270 18-Mar-2017 riastradh

Nix trailing whitespace.


Revision tags: nick-nhusb-base-20170204
# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

branches: 1.267.2;
Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-2-RELEASE netbsd-7-1-2-RELEASE netbsd-7-1-1-RELEASE netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


Revision tags: ad-namecache-base1
# 1.287 17-Jan-2020 ad

biodone2(): don't acquire kernel_lock for anybody anymore.


Revision tags: ad-namecache-base
# 1.286 31-Dec-2019 ad

branches: 1.286.2;
Rename uvm_free() -> uvm_availmem().


# 1.285 27-Dec-2019 msaitoh

s/transfered/transferred/


# 1.284 21-Dec-2019 ad

uvmexp.free -> uvm_free()


# 1.283 11-Dec-2019 ad

Add a comment.


# 1.282 08-Dec-2019 ad

For safety, cv_broadcast(&bp->b_busy) in more places where the buffer is
changing identity or moving from one vnode list to another.


# 1.281 08-Dec-2019 ad

Adjustment to previous: if we're going to toss the buffer, then wake
everybody.


# 1.280 08-Dec-2019 ad

- Avoid thundering herd: cv_broadcast(&bp->b_busy) -> cv_signal(&bp->b_busy)
- Sprinkle __cacheline_aligned.


Revision tags: phil-wifi-20191119
# 1.279 26-Aug-2019 msaitoh

Change buf_nbuf()'s return value from int to u_int to avoid undefined
behavior in wapbl_start() which extended int to size_t.

Error message was:
> UBSan: Undefined Behavior in ../../../../kern/vfs_wapbl.c:609:41, signed integer overflow: 3345138 * 1024 cannot be represented in type 'int'

> /* XXX maybe use filesystem fragment size instead of 1024 */
> /* XXX fix actual number of buffers reserved per filesystem. */
> wl->wl_bufcount_max = (buf_nbuf() / 2) * 1024;

Need more work?


Revision tags: netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126
# 1.278 24-Nov-2018 maxv

Fix kernel pointer leaks in sysctl_dobuf. While here constify argument.

Also memset the buffer, to prevent leaks (even if there doesn't seem to
be currently).


Revision tags: pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.277 29-Aug-2018 hannken

Make sure getnewbuf() runs bawrite() inside fstrans.
Use fstrans_start_nowait() to skip buffers that would block.


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.276 28-Oct-2017 pgoyette

branches: 1.276.2; 1.276.4;
Update the kernhist(9) kernel history code to address issues identified
in PR kern/52639, as well as some general cleaning-up...

(As proposed on tech-kern@ with additional changes and enhancements.)

Details of changes:

* All history arguments are now stored as uintmax_t values[1], both in
the kernel and in the structures used for exporting the history data
to userland via sysctl(9). This avoids problems on some architectures
where passing a 64-bit (or larger) value to printf(3) can cause it to
process the value as multiple arguments. (This can be particularly
problematic when printf()'s format string is not a literal, since in
that case the compiler cannot know how large each argument should be.)

* Update the data structures used for exporting kernel history data to
include a version number as well as the length of history arguments.

* All [2] existing users of kernhist(9) have had their format strings
updated. Each format specifier now includes an explicit length
modifier 'j' to refer to numeric values of the size of uintmax_t.

* All [2] existing users of kernhist(9) have had their format strings
updated to replace uses of "%p" with "%#jx", and the pointer
arguments are now cast to (uintptr_t) before being subsequently cast
to (uintmax_t). This is needed to avoid compiler warnings about
casting "pointer to integer of a different size."

* All [2] existing users of kernhist(9) have had instances of "%s" or
"%c" format strings replaced with numeric formats; several instances
of mis-match between format string and argument list have been fixed.

* vmstat(1) has been modified to handle the new size of arguments in the
history data as exported by sysctl(9).

* vmstat(1) now provides a warning message if the history requested with
the -u option does not exist (previously, this condition was silently
ignored, with only a single blank line being printed).

* vmstat(1) now checks the version and argument length included in the
data exported via sysctl(9) and exits if they do not match the values
with which vmstat was built.

* The kernhist(9) man-page has been updated to note the additional
requirements imposed on the format strings, along with several other
minor changes and enhancements.

[1] It would have been possible to use an explicit length (for example,
uint64_t) for the history arguments. But that would require another
"rototill" of all the users in the future when we add support for an
architecture that supports a larger size. Also, the printf(3) format
specifiers for explicitly-sized values, such as "%"PRIu64, are much
more verbose (and less aesthetically appealing, IMHO) than simply
using "%ju".

[2] I've tried very hard to find "all [the] existing users of kernhist(9)"
but it is possible that I've missed some of them. I would be glad to
update any stragglers that anyone identifies.


Revision tags: nick-nhusb-base-20170825
# 1.275 04-Aug-2017 mrg

normalise a BIOHIST log message


Revision tags: perseant-stdc-iso10646-base
# 1.274 08-Jun-2017 chs

move some buffer cache internals declarations from buf.h to vfs_bio.c.
this is needed to avoid name conflicts with ZFS and also
makes it clearer that other code shouldn't be messing with these.
remove the LFS debug code that poked around in bufqueues and
remove the BQ_EMPTY bufqueue since nothing uses it anymore.
provide a function to let LFS and wapbl read the value of nbuf for now.


Revision tags: netbsd-8-base
# 1.273 25-May-2017 pgoyette

branches: 1.273.2;
When logging a history record for biowait(), include the return address
as a parameter, to identify to which of the many calls to biowait() the
record refers.


Revision tags: prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.272 05-Apr-2017 jdolecek

expose disk device FUA/DPO support via DIOCGCACHE, and allow the flags
to be set for I/O; implement support in sd(4) and nvme(4)

discussed on tech-kern


# 1.271 21-Mar-2017 skrll

Use brelsel while the bufcache_lock is held rather than dropping it
and re-taking / dropping it in brelse


Revision tags: pgoyette-localcount-20170320
# 1.270 18-Mar-2017 riastradh

Nix trailing whitespace.


Revision tags: nick-nhusb-base-20170204
# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

branches: 1.267.2;
Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-2-RELEASE netbsd-7-1-2-RELEASE netbsd-7-1-1-RELEASE netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


# 1.286 31-Dec-2019 ad

Rename uvm_free() -> uvm_availmem().


# 1.285 27-Dec-2019 msaitoh

s/transfered/transferred/


# 1.284 21-Dec-2019 ad

uvmexp.free -> uvm_free()


# 1.283 11-Dec-2019 ad

Add a comment.


# 1.282 08-Dec-2019 ad

For safety, cv_broadcast(&bp->b_busy) in more places where the buffer is
changing identity or moving from one vnode list to another.


# 1.281 08-Dec-2019 ad

Adjustment to previous: if we're going to toss the buffer, then wake
everybody.


# 1.280 08-Dec-2019 ad

- Avoid thundering herd: cv_broadcast(&bp->b_busy) -> cv_signal(&bp->b_busy)
- Sprinkle __cacheline_aligned.


Revision tags: phil-wifi-20191119
# 1.279 26-Aug-2019 msaitoh

Change buf_nbuf()'s return value from int to u_int to avoid undefined
behavior in wapbl_start() which extended int to size_t.

Error message was:
> UBSan: Undefined Behavior in ../../../../kern/vfs_wapbl.c:609:41, signed integer overflow: 3345138 * 1024 cannot be represented in type 'int'

> /* XXX maybe use filesystem fragment size instead of 1024 */
> /* XXX fix actual number of buffers reserved per filesystem. */
> wl->wl_bufcount_max = (buf_nbuf() / 2) * 1024;

Need more work?


Revision tags: netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126
# 1.278 24-Nov-2018 maxv

Fix kernel pointer leaks in sysctl_dobuf. While here constify argument.

Also memset the buffer, to prevent leaks (even if there doesn't seem to
be currently).


Revision tags: pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.277 29-Aug-2018 hannken

Make sure getnewbuf() runs bawrite() inside fstrans.
Use fstrans_start_nowait() to skip buffers that would block.


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.276 28-Oct-2017 pgoyette

branches: 1.276.2; 1.276.4;
Update the kernhist(9) kernel history code to address issues identified
in PR kern/52639, as well as some general cleaning-up...

(As proposed on tech-kern@ with additional changes and enhancements.)

Details of changes:

* All history arguments are now stored as uintmax_t values[1], both in
the kernel and in the structures used for exporting the history data
to userland via sysctl(9). This avoids problems on some architectures
where passing a 64-bit (or larger) value to printf(3) can cause it to
process the value as multiple arguments. (This can be particularly
problematic when printf()'s format string is not a literal, since in
that case the compiler cannot know how large each argument should be.)

* Update the data structures used for exporting kernel history data to
include a version number as well as the length of history arguments.

* All [2] existing users of kernhist(9) have had their format strings
updated. Each format specifier now includes an explicit length
modifier 'j' to refer to numeric values of the size of uintmax_t.

* All [2] existing users of kernhist(9) have had their format strings
updated to replace uses of "%p" with "%#jx", and the pointer
arguments are now cast to (uintptr_t) before being subsequently cast
to (uintmax_t). This is needed to avoid compiler warnings about
casting "pointer to integer of a different size."

* All [2] existing users of kernhist(9) have had instances of "%s" or
"%c" format strings replaced with numeric formats; several instances
of mis-match between format string and argument list have been fixed.

* vmstat(1) has been modified to handle the new size of arguments in the
history data as exported by sysctl(9).

* vmstat(1) now provides a warning message if the history requested with
the -u option does not exist (previously, this condition was silently
ignored, with only a single blank line being printed).

* vmstat(1) now checks the version and argument length included in the
data exported via sysctl(9) and exits if they do not match the values
with which vmstat was built.

* The kernhist(9) man-page has been updated to note the additional
requirements imposed on the format strings, along with several other
minor changes and enhancements.

[1] It would have been possible to use an explicit length (for example,
uint64_t) for the history arguments. But that would require another
"rototill" of all the users in the future when we add support for an
architecture that supports a larger size. Also, the printf(3) format
specifiers for explicitly-sized values, such as "%"PRIu64, are much
more verbose (and less aesthetically appealing, IMHO) than simply
using "%ju".

[2] I've tried very hard to find "all [the] existing users of kernhist(9)"
but it is possible that I've missed some of them. I would be glad to
update any stragglers that anyone identifies.


Revision tags: nick-nhusb-base-20170825
# 1.275 04-Aug-2017 mrg

normalise a BIOHIST log message


Revision tags: perseant-stdc-iso10646-base
# 1.274 08-Jun-2017 chs

move some buffer cache internals declarations from buf.h to vfs_bio.c.
this is needed to avoid name conflicts with ZFS and also
makes it clearer that other code shouldn't be messing with these.
remove the LFS debug code that poked around in bufqueues and
remove the BQ_EMPTY bufqueue since nothing uses it anymore.
provide a function to let LFS and wapbl read the value of nbuf for now.


Revision tags: netbsd-8-base
# 1.273 25-May-2017 pgoyette

branches: 1.273.2;
When logging a history record for biowait(), include the return address
as a parameter, to identify to which of the many calls to biowait() the
record refers.


Revision tags: prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.272 05-Apr-2017 jdolecek

expose disk device FUA/DPO support via DIOCGCACHE, and allow the flags
to be set for I/O; implement support in sd(4) and nvme(4)

discussed on tech-kern


# 1.271 21-Mar-2017 skrll

Use brelsel while the bufcache_lock is held rather than dropping it
and re-taking / dropping it in brelse


Revision tags: pgoyette-localcount-20170320
# 1.270 18-Mar-2017 riastradh

Nix trailing whitespace.


Revision tags: nick-nhusb-base-20170204
# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

branches: 1.267.2;
Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-2-RELEASE netbsd-7-1-2-RELEASE netbsd-7-1-1-RELEASE netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


# 1.285 27-Dec-2019 msaitoh

s/transfered/transferred/


# 1.284 21-Dec-2019 ad

uvmexp.free -> uvm_free()


# 1.283 11-Dec-2019 ad

Add a comment.


# 1.282 08-Dec-2019 ad

For safety, cv_broadcast(&bp->b_busy) in more places where the buffer is
changing identity or moving from one vnode list to another.


# 1.281 08-Dec-2019 ad

Adjustment to previous: if we're going to toss the buffer, then wake
everybody.


# 1.280 08-Dec-2019 ad

- Avoid thundering herd: cv_broadcast(&bp->b_busy) -> cv_signal(&bp->b_busy)
- Sprinkle __cacheline_aligned.


Revision tags: phil-wifi-20191119
# 1.279 26-Aug-2019 msaitoh

Change buf_nbuf()'s return value from int to u_int to avoid undefined
behavior in wapbl_start() which extended int to size_t.

Error message was:
> UBSan: Undefined Behavior in ../../../../kern/vfs_wapbl.c:609:41, signed integer overflow: 3345138 * 1024 cannot be represented in type 'int'

> /* XXX maybe use filesystem fragment size instead of 1024 */
> /* XXX fix actual number of buffers reserved per filesystem. */
> wl->wl_bufcount_max = (buf_nbuf() / 2) * 1024;

Need more work?


Revision tags: netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126
# 1.278 24-Nov-2018 maxv

Fix kernel pointer leaks in sysctl_dobuf. While here constify argument.

Also memset the buffer, to prevent leaks (even if there doesn't seem to
be currently).


Revision tags: pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.277 29-Aug-2018 hannken

Make sure getnewbuf() runs bawrite() inside fstrans.
Use fstrans_start_nowait() to skip buffers that would block.


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.276 28-Oct-2017 pgoyette

branches: 1.276.2; 1.276.4;
Update the kernhist(9) kernel history code to address issues identified
in PR kern/52639, as well as some general cleaning-up...

(As proposed on tech-kern@ with additional changes and enhancements.)

Details of changes:

* All history arguments are now stored as uintmax_t values[1], both in
the kernel and in the structures used for exporting the history data
to userland via sysctl(9). This avoids problems on some architectures
where passing a 64-bit (or larger) value to printf(3) can cause it to
process the value as multiple arguments. (This can be particularly
problematic when printf()'s format string is not a literal, since in
that case the compiler cannot know how large each argument should be.)

* Update the data structures used for exporting kernel history data to
include a version number as well as the length of history arguments.

* All [2] existing users of kernhist(9) have had their format strings
updated. Each format specifier now includes an explicit length
modifier 'j' to refer to numeric values of the size of uintmax_t.

* All [2] existing users of kernhist(9) have had their format strings
updated to replace uses of "%p" with "%#jx", and the pointer
arguments are now cast to (uintptr_t) before being subsequently cast
to (uintmax_t). This is needed to avoid compiler warnings about
casting "pointer to integer of a different size."

* All [2] existing users of kernhist(9) have had instances of "%s" or
"%c" format strings replaced with numeric formats; several instances
of mis-match between format string and argument list have been fixed.

* vmstat(1) has been modified to handle the new size of arguments in the
history data as exported by sysctl(9).

* vmstat(1) now provides a warning message if the history requested with
the -u option does not exist (previously, this condition was silently
ignored, with only a single blank line being printed).

* vmstat(1) now checks the version and argument length included in the
data exported via sysctl(9) and exits if they do not match the values
with which vmstat was built.

* The kernhist(9) man-page has been updated to note the additional
requirements imposed on the format strings, along with several other
minor changes and enhancements.

[1] It would have been possible to use an explicit length (for example,
uint64_t) for the history arguments. But that would require another
"rototill" of all the users in the future when we add support for an
architecture that supports a larger size. Also, the printf(3) format
specifiers for explicitly-sized values, such as "%"PRIu64, are much
more verbose (and less aesthetically appealing, IMHO) than simply
using "%ju".

[2] I've tried very hard to find "all [the] existing users of kernhist(9)"
but it is possible that I've missed some of them. I would be glad to
update any stragglers that anyone identifies.


Revision tags: nick-nhusb-base-20170825
# 1.275 04-Aug-2017 mrg

normalise a BIOHIST log message


Revision tags: perseant-stdc-iso10646-base
# 1.274 08-Jun-2017 chs

move some buffer cache internals declarations from buf.h to vfs_bio.c.
this is needed to avoid name conflicts with ZFS and also
makes it clearer that other code shouldn't be messing with these.
remove the LFS debug code that poked around in bufqueues and
remove the BQ_EMPTY bufqueue since nothing uses it anymore.
provide a function to let LFS and wapbl read the value of nbuf for now.


Revision tags: netbsd-8-base
# 1.273 25-May-2017 pgoyette

branches: 1.273.2;
When logging a history record for biowait(), include the return address
as a parameter, to identify to which of the many calls to biowait() the
record refers.


Revision tags: prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.272 05-Apr-2017 jdolecek

expose disk device FUA/DPO support via DIOCGCACHE, and allow the flags
to be set for I/O; implement support in sd(4) and nvme(4)

discussed on tech-kern


# 1.271 21-Mar-2017 skrll

Use brelsel while the bufcache_lock is held rather than dropping it
and re-taking / dropping it in brelse


Revision tags: pgoyette-localcount-20170320
# 1.270 18-Mar-2017 riastradh

Nix trailing whitespace.


Revision tags: nick-nhusb-base-20170204
# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

branches: 1.267.2;
Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-2-RELEASE netbsd-7-1-2-RELEASE netbsd-7-1-1-RELEASE netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


# 1.284 21-Dec-2019 ad

uvmexp.free -> uvm_free()


# 1.283 11-Dec-2019 ad

Add a comment.


# 1.282 08-Dec-2019 ad

For safety, cv_broadcast(&bp->b_busy) in more places where the buffer is
changing identity or moving from one vnode list to another.


# 1.281 08-Dec-2019 ad

Adjustment to previous: if we're going to toss the buffer, then wake
everybody.


# 1.280 08-Dec-2019 ad

- Avoid thundering herd: cv_broadcast(&bp->b_busy) -> cv_signal(&bp->b_busy)
- Sprinkle __cacheline_aligned.


Revision tags: phil-wifi-20191119
# 1.279 26-Aug-2019 msaitoh

Change buf_nbuf()'s return value from int to u_int to avoid undefined
behavior in wapbl_start() which extended int to size_t.

Error message was:
> UBSan: Undefined Behavior in ../../../../kern/vfs_wapbl.c:609:41, signed integer overflow: 3345138 * 1024 cannot be represented in type 'int'

> /* XXX maybe use filesystem fragment size instead of 1024 */
> /* XXX fix actual number of buffers reserved per filesystem. */
> wl->wl_bufcount_max = (buf_nbuf() / 2) * 1024;

Need more work?


Revision tags: netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126
# 1.278 24-Nov-2018 maxv

Fix kernel pointer leaks in sysctl_dobuf. While here constify argument.

Also memset the buffer, to prevent leaks (even if there doesn't seem to
be currently).


Revision tags: pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.277 29-Aug-2018 hannken

Make sure getnewbuf() runs bawrite() inside fstrans.
Use fstrans_start_nowait() to skip buffers that would block.


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.276 28-Oct-2017 pgoyette

branches: 1.276.2; 1.276.4;
Update the kernhist(9) kernel history code to address issues identified
in PR kern/52639, as well as some general cleaning-up...

(As proposed on tech-kern@ with additional changes and enhancements.)

Details of changes:

* All history arguments are now stored as uintmax_t values[1], both in
the kernel and in the structures used for exporting the history data
to userland via sysctl(9). This avoids problems on some architectures
where passing a 64-bit (or larger) value to printf(3) can cause it to
process the value as multiple arguments. (This can be particularly
problematic when printf()'s format string is not a literal, since in
that case the compiler cannot know how large each argument should be.)

* Update the data structures used for exporting kernel history data to
include a version number as well as the length of history arguments.

* All [2] existing users of kernhist(9) have had their format strings
updated. Each format specifier now includes an explicit length
modifier 'j' to refer to numeric values of the size of uintmax_t.

* All [2] existing users of kernhist(9) have had their format strings
updated to replace uses of "%p" with "%#jx", and the pointer
arguments are now cast to (uintptr_t) before being subsequently cast
to (uintmax_t). This is needed to avoid compiler warnings about
casting "pointer to integer of a different size."

* All [2] existing users of kernhist(9) have had instances of "%s" or
"%c" format strings replaced with numeric formats; several instances
of mis-match between format string and argument list have been fixed.

* vmstat(1) has been modified to handle the new size of arguments in the
history data as exported by sysctl(9).

* vmstat(1) now provides a warning message if the history requested with
the -u option does not exist (previously, this condition was silently
ignored, with only a single blank line being printed).

* vmstat(1) now checks the version and argument length included in the
data exported via sysctl(9) and exits if they do not match the values
with which vmstat was built.

* The kernhist(9) man-page has been updated to note the additional
requirements imposed on the format strings, along with several other
minor changes and enhancements.

[1] It would have been possible to use an explicit length (for example,
uint64_t) for the history arguments. But that would require another
"rototill" of all the users in the future when we add support for an
architecture that supports a larger size. Also, the printf(3) format
specifiers for explicitly-sized values, such as "%"PRIu64, are much
more verbose (and less aesthetically appealing, IMHO) than simply
using "%ju".

[2] I've tried very hard to find "all [the] existing users of kernhist(9)"
but it is possible that I've missed some of them. I would be glad to
update any stragglers that anyone identifies.


Revision tags: nick-nhusb-base-20170825
# 1.275 04-Aug-2017 mrg

normalise a BIOHIST log message


Revision tags: perseant-stdc-iso10646-base
# 1.274 08-Jun-2017 chs

move some buffer cache internals declarations from buf.h to vfs_bio.c.
this is needed to avoid name conflicts with ZFS and also
makes it clearer that other code shouldn't be messing with these.
remove the LFS debug code that poked around in bufqueues and
remove the BQ_EMPTY bufqueue since nothing uses it anymore.
provide a function to let LFS and wapbl read the value of nbuf for now.


Revision tags: netbsd-8-base
# 1.273 25-May-2017 pgoyette

branches: 1.273.2;
When logging a history record for biowait(), include the return address
as a parameter, to identify to which of the many calls to biowait() the
record refers.


Revision tags: prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.272 05-Apr-2017 jdolecek

expose disk device FUA/DPO support via DIOCGCACHE, and allow the flags
to be set for I/O; implement support in sd(4) and nvme(4)

discussed on tech-kern


# 1.271 21-Mar-2017 skrll

Use brelsel while the bufcache_lock is held rather than dropping it
and re-taking / dropping it in brelse


Revision tags: pgoyette-localcount-20170320
# 1.270 18-Mar-2017 riastradh

Nix trailing whitespace.


Revision tags: nick-nhusb-base-20170204
# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

branches: 1.267.2;
Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-2-RELEASE netbsd-7-1-2-RELEASE netbsd-7-1-1-RELEASE netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


# 1.283 11-Dec-2019 ad

Add a comment.


# 1.282 08-Dec-2019 ad

For safety, cv_broadcast(&bp->b_busy) in more places where the buffer is
changing identity or moving from one vnode list to another.


# 1.281 08-Dec-2019 ad

Adjustment to previous: if we're going to toss the buffer, then wake
everybody.


# 1.280 08-Dec-2019 ad

- Avoid thundering herd: cv_broadcast(&bp->b_busy) -> cv_signal(&bp->b_busy)
- Sprinkle __cacheline_aligned.


Revision tags: phil-wifi-20191119
# 1.279 26-Aug-2019 msaitoh

Change buf_nbuf()'s return value from int to u_int to avoid undefined
behavior in wapbl_start() which extended int to size_t.

Error message was:
> UBSan: Undefined Behavior in ../../../../kern/vfs_wapbl.c:609:41, signed integer overflow: 3345138 * 1024 cannot be represented in type 'int'

> /* XXX maybe use filesystem fragment size instead of 1024 */
> /* XXX fix actual number of buffers reserved per filesystem. */
> wl->wl_bufcount_max = (buf_nbuf() / 2) * 1024;

Need more work?


Revision tags: netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126
# 1.278 24-Nov-2018 maxv

Fix kernel pointer leaks in sysctl_dobuf. While here constify argument.

Also memset the buffer, to prevent leaks (even if there doesn't seem to
be currently).


Revision tags: pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.277 29-Aug-2018 hannken

Make sure getnewbuf() runs bawrite() inside fstrans.
Use fstrans_start_nowait() to skip buffers that would block.


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.276 28-Oct-2017 pgoyette

branches: 1.276.2; 1.276.4;
Update the kernhist(9) kernel history code to address issues identified
in PR kern/52639, as well as some general cleaning-up...

(As proposed on tech-kern@ with additional changes and enhancements.)

Details of changes:

* All history arguments are now stored as uintmax_t values[1], both in
the kernel and in the structures used for exporting the history data
to userland via sysctl(9). This avoids problems on some architectures
where passing a 64-bit (or larger) value to printf(3) can cause it to
process the value as multiple arguments. (This can be particularly
problematic when printf()'s format string is not a literal, since in
that case the compiler cannot know how large each argument should be.)

* Update the data structures used for exporting kernel history data to
include a version number as well as the length of history arguments.

* All [2] existing users of kernhist(9) have had their format strings
updated. Each format specifier now includes an explicit length
modifier 'j' to refer to numeric values of the size of uintmax_t.

* All [2] existing users of kernhist(9) have had their format strings
updated to replace uses of "%p" with "%#jx", and the pointer
arguments are now cast to (uintptr_t) before being subsequently cast
to (uintmax_t). This is needed to avoid compiler warnings about
casting "pointer to integer of a different size."

* All [2] existing users of kernhist(9) have had instances of "%s" or
"%c" format strings replaced with numeric formats; several instances
of mis-match between format string and argument list have been fixed.

* vmstat(1) has been modified to handle the new size of arguments in the
history data as exported by sysctl(9).

* vmstat(1) now provides a warning message if the history requested with
the -u option does not exist (previously, this condition was silently
ignored, with only a single blank line being printed).

* vmstat(1) now checks the version and argument length included in the
data exported via sysctl(9) and exits if they do not match the values
with which vmstat was built.

* The kernhist(9) man-page has been updated to note the additional
requirements imposed on the format strings, along with several other
minor changes and enhancements.

[1] It would have been possible to use an explicit length (for example,
uint64_t) for the history arguments. But that would require another
"rototill" of all the users in the future when we add support for an
architecture that supports a larger size. Also, the printf(3) format
specifiers for explicitly-sized values, such as "%"PRIu64, are much
more verbose (and less aesthetically appealing, IMHO) than simply
using "%ju".

[2] I've tried very hard to find "all [the] existing users of kernhist(9)"
but it is possible that I've missed some of them. I would be glad to
update any stragglers that anyone identifies.


Revision tags: nick-nhusb-base-20170825
# 1.275 04-Aug-2017 mrg

normalise a BIOHIST log message


Revision tags: perseant-stdc-iso10646-base
# 1.274 08-Jun-2017 chs

move some buffer cache internals declarations from buf.h to vfs_bio.c.
this is needed to avoid name conflicts with ZFS and also
makes it clearer that other code shouldn't be messing with these.
remove the LFS debug code that poked around in bufqueues and
remove the BQ_EMPTY bufqueue since nothing uses it anymore.
provide a function to let LFS and wapbl read the value of nbuf for now.


Revision tags: netbsd-8-base
# 1.273 25-May-2017 pgoyette

branches: 1.273.2;
When logging a history record for biowait(), include the return address
as a parameter, to identify to which of the many calls to biowait() the
record refers.


Revision tags: prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.272 05-Apr-2017 jdolecek

expose disk device FUA/DPO support via DIOCGCACHE, and allow the flags
to be set for I/O; implement support in sd(4) and nvme(4)

discussed on tech-kern


# 1.271 21-Mar-2017 skrll

Use brelsel while the bufcache_lock is held rather than dropping it
and re-taking / dropping it in brelse


Revision tags: pgoyette-localcount-20170320
# 1.270 18-Mar-2017 riastradh

Nix trailing whitespace.


Revision tags: nick-nhusb-base-20170204
# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

branches: 1.267.2;
Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-2-RELEASE netbsd-7-1-2-RELEASE netbsd-7-1-1-RELEASE netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


# 1.282 08-Dec-2019 ad

For safety, cv_broadcast(&bp->b_busy) in more places where the buffer is
changing identity or moving from one vnode list to another.


# 1.281 08-Dec-2019 ad

Adjustment to previous: if we're going to toss the buffer, then wake
everybody.


# 1.280 08-Dec-2019 ad

- Avoid thundering herd: cv_broadcast(&bp->b_busy) -> cv_signal(&bp->b_busy)
- Sprinkle __cacheline_aligned.


Revision tags: phil-wifi-20191119
# 1.279 26-Aug-2019 msaitoh

Change buf_nbuf()'s return value from int to u_int to avoid undefined
behavior in wapbl_start() which extended int to size_t.

Error message was:
> UBSan: Undefined Behavior in ../../../../kern/vfs_wapbl.c:609:41, signed integer overflow: 3345138 * 1024 cannot be represented in type 'int'

> /* XXX maybe use filesystem fragment size instead of 1024 */
> /* XXX fix actual number of buffers reserved per filesystem. */
> wl->wl_bufcount_max = (buf_nbuf() / 2) * 1024;

Need more work?


Revision tags: netbsd-9-0-RC1 netbsd-9-base phil-wifi-20190609 isaki-audio2-base pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126
# 1.278 24-Nov-2018 maxv

Fix kernel pointer leaks in sysctl_dobuf. While here constify argument.

Also memset the buffer, to prevent leaks (even if there doesn't seem to
be currently).


Revision tags: pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.277 29-Aug-2018 hannken

Make sure getnewbuf() runs bawrite() inside fstrans.
Use fstrans_start_nowait() to skip buffers that would block.


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.276 28-Oct-2017 pgoyette

branches: 1.276.2; 1.276.4;
Update the kernhist(9) kernel history code to address issues identified
in PR kern/52639, as well as some general cleaning-up...

(As proposed on tech-kern@ with additional changes and enhancements.)

Details of changes:

* All history arguments are now stored as uintmax_t values[1], both in
the kernel and in the structures used for exporting the history data
to userland via sysctl(9). This avoids problems on some architectures
where passing a 64-bit (or larger) value to printf(3) can cause it to
process the value as multiple arguments. (This can be particularly
problematic when printf()'s format string is not a literal, since in
that case the compiler cannot know how large each argument should be.)

* Update the data structures used for exporting kernel history data to
include a version number as well as the length of history arguments.

* All [2] existing users of kernhist(9) have had their format strings
updated. Each format specifier now includes an explicit length
modifier 'j' to refer to numeric values of the size of uintmax_t.

* All [2] existing users of kernhist(9) have had their format strings
updated to replace uses of "%p" with "%#jx", and the pointer
arguments are now cast to (uintptr_t) before being subsequently cast
to (uintmax_t). This is needed to avoid compiler warnings about
casting "pointer to integer of a different size."

* All [2] existing users of kernhist(9) have had instances of "%s" or
"%c" format strings replaced with numeric formats; several instances
of mis-match between format string and argument list have been fixed.

* vmstat(1) has been modified to handle the new size of arguments in the
history data as exported by sysctl(9).

* vmstat(1) now provides a warning message if the history requested with
the -u option does not exist (previously, this condition was silently
ignored, with only a single blank line being printed).

* vmstat(1) now checks the version and argument length included in the
data exported via sysctl(9) and exits if they do not match the values
with which vmstat was built.

* The kernhist(9) man-page has been updated to note the additional
requirements imposed on the format strings, along with several other
minor changes and enhancements.

[1] It would have been possible to use an explicit length (for example,
uint64_t) for the history arguments. But that would require another
"rototill" of all the users in the future when we add support for an
architecture that supports a larger size. Also, the printf(3) format
specifiers for explicitly-sized values, such as "%"PRIu64, are much
more verbose (and less aesthetically appealing, IMHO) than simply
using "%ju".

[2] I've tried very hard to find "all [the] existing users of kernhist(9)"
but it is possible that I've missed some of them. I would be glad to
update any stragglers that anyone identifies.


Revision tags: nick-nhusb-base-20170825
# 1.275 04-Aug-2017 mrg

normalise a BIOHIST log message


Revision tags: perseant-stdc-iso10646-base
# 1.274 08-Jun-2017 chs

move some buffer cache internals declarations from buf.h to vfs_bio.c.
this is needed to avoid name conflicts with ZFS and also
makes it clearer that other code shouldn't be messing with these.
remove the LFS debug code that poked around in bufqueues and
remove the BQ_EMPTY bufqueue since nothing uses it anymore.
provide a function to let LFS and wapbl read the value of nbuf for now.


Revision tags: netbsd-8-base
# 1.273 25-May-2017 pgoyette

branches: 1.273.2;
When logging a history record for biowait(), include the return address
as a parameter, to identify to which of the many calls to biowait() the
record refers.


Revision tags: prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.272 05-Apr-2017 jdolecek

expose disk device FUA/DPO support via DIOCGCACHE, and allow the flags
to be set for I/O; implement support in sd(4) and nvme(4)

discussed on tech-kern


# 1.271 21-Mar-2017 skrll

Use brelsel while the bufcache_lock is held rather than dropping it
and re-taking / dropping it in brelse


Revision tags: pgoyette-localcount-20170320
# 1.270 18-Mar-2017 riastradh

Nix trailing whitespace.


Revision tags: nick-nhusb-base-20170204
# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

branches: 1.267.2;
Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-2-RELEASE netbsd-7-1-2-RELEASE netbsd-7-1-1-RELEASE netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


# 1.279 26-Aug-2019 msaitoh

Change buf_nbuf()'s return value from int to u_int to avoid undefined
behavior in wapbl_start() which extended int to size_t.

Error message was:
> UBSan: Undefined Behavior in ../../../../kern/vfs_wapbl.c:609:41, signed integer overflow: 3345138 * 1024 cannot be represented in type 'int'

> /* XXX maybe use filesystem fragment size instead of 1024 */
> /* XXX fix actual number of buffers reserved per filesystem. */
> wl->wl_bufcount_max = (buf_nbuf() / 2) * 1024;

Need more work?


Revision tags: netbsd-9-base phil-wifi-20190609 isaki-audio2-base pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126
# 1.278 24-Nov-2018 maxv

Fix kernel pointer leaks in sysctl_dobuf. While here constify argument.

Also memset the buffer, to prevent leaks (even if there doesn't seem to
be currently).


Revision tags: pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.277 29-Aug-2018 hannken

Make sure getnewbuf() runs bawrite() inside fstrans.
Use fstrans_start_nowait() to skip buffers that would block.


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.276 28-Oct-2017 pgoyette

branches: 1.276.2; 1.276.4;
Update the kernhist(9) kernel history code to address issues identified
in PR kern/52639, as well as some general cleaning-up...

(As proposed on tech-kern@ with additional changes and enhancements.)

Details of changes:

* All history arguments are now stored as uintmax_t values[1], both in
the kernel and in the structures used for exporting the history data
to userland via sysctl(9). This avoids problems on some architectures
where passing a 64-bit (or larger) value to printf(3) can cause it to
process the value as multiple arguments. (This can be particularly
problematic when printf()'s format string is not a literal, since in
that case the compiler cannot know how large each argument should be.)

* Update the data structures used for exporting kernel history data to
include a version number as well as the length of history arguments.

* All [2] existing users of kernhist(9) have had their format strings
updated. Each format specifier now includes an explicit length
modifier 'j' to refer to numeric values of the size of uintmax_t.

* All [2] existing users of kernhist(9) have had their format strings
updated to replace uses of "%p" with "%#jx", and the pointer
arguments are now cast to (uintptr_t) before being subsequently cast
to (uintmax_t). This is needed to avoid compiler warnings about
casting "pointer to integer of a different size."

* All [2] existing users of kernhist(9) have had instances of "%s" or
"%c" format strings replaced with numeric formats; several instances
of mis-match between format string and argument list have been fixed.

* vmstat(1) has been modified to handle the new size of arguments in the
history data as exported by sysctl(9).

* vmstat(1) now provides a warning message if the history requested with
the -u option does not exist (previously, this condition was silently
ignored, with only a single blank line being printed).

* vmstat(1) now checks the version and argument length included in the
data exported via sysctl(9) and exits if they do not match the values
with which vmstat was built.

* The kernhist(9) man-page has been updated to note the additional
requirements imposed on the format strings, along with several other
minor changes and enhancements.

[1] It would have been possible to use an explicit length (for example,
uint64_t) for the history arguments. But that would require another
"rototill" of all the users in the future when we add support for an
architecture that supports a larger size. Also, the printf(3) format
specifiers for explicitly-sized values, such as "%"PRIu64, are much
more verbose (and less aesthetically appealing, IMHO) than simply
using "%ju".

[2] I've tried very hard to find "all [the] existing users of kernhist(9)"
but it is possible that I've missed some of them. I would be glad to
update any stragglers that anyone identifies.


Revision tags: nick-nhusb-base-20170825
# 1.275 04-Aug-2017 mrg

normalise a BIOHIST log message


Revision tags: perseant-stdc-iso10646-base
# 1.274 08-Jun-2017 chs

move some buffer cache internals declarations from buf.h to vfs_bio.c.
this is needed to avoid name conflicts with ZFS and also
makes it clearer that other code shouldn't be messing with these.
remove the LFS debug code that poked around in bufqueues and
remove the BQ_EMPTY bufqueue since nothing uses it anymore.
provide a function to let LFS and wapbl read the value of nbuf for now.


Revision tags: netbsd-8-base
# 1.273 25-May-2017 pgoyette

branches: 1.273.2;
When logging a history record for biowait(), include the return address
as a parameter, to identify to which of the many calls to biowait() the
record refers.


Revision tags: prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.272 05-Apr-2017 jdolecek

expose disk device FUA/DPO support via DIOCGCACHE, and allow the flags
to be set for I/O; implement support in sd(4) and nvme(4)

discussed on tech-kern


# 1.271 21-Mar-2017 skrll

Use brelsel while the bufcache_lock is held rather than dropping it
and re-taking / dropping it in brelse


Revision tags: pgoyette-localcount-20170320
# 1.270 18-Mar-2017 riastradh

Nix trailing whitespace.


Revision tags: nick-nhusb-base-20170204
# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

branches: 1.267.2;
Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-2-RELEASE netbsd-7-1-2-RELEASE netbsd-7-1-1-RELEASE netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


Revision tags: isaki-audio2-base pgoyette-compat-20190127 pgoyette-compat-20190118 pgoyette-compat-1226 pgoyette-compat-1126
# 1.278 24-Nov-2018 maxv

Fix kernel pointer leaks in sysctl_dobuf. While here constify argument.

Also memset the buffer, to prevent leaks (even if there doesn't seem to
be currently).


Revision tags: pgoyette-compat-1020 pgoyette-compat-0930 pgoyette-compat-0906
# 1.277 29-Aug-2018 hannken

Make sure getnewbuf() runs bawrite() inside fstrans.
Use fstrans_start_nowait() to skip buffers that would block.


Revision tags: pgoyette-compat-0728 phil-wifi-base pgoyette-compat-0625 pgoyette-compat-0521 pgoyette-compat-0502 pgoyette-compat-0422 pgoyette-compat-0415 pgoyette-compat-0407 pgoyette-compat-0330 pgoyette-compat-0322 pgoyette-compat-0315 pgoyette-compat-base tls-maxphys-base-20171202
# 1.276 28-Oct-2017 pgoyette

branches: 1.276.2;
Update the kernhist(9) kernel history code to address issues identified
in PR kern/52639, as well as some general cleaning-up...

(As proposed on tech-kern@ with additional changes and enhancements.)

Details of changes:

* All history arguments are now stored as uintmax_t values[1], both in
the kernel and in the structures used for exporting the history data
to userland via sysctl(9). This avoids problems on some architectures
where passing a 64-bit (or larger) value to printf(3) can cause it to
process the value as multiple arguments. (This can be particularly
problematic when printf()'s format string is not a literal, since in
that case the compiler cannot know how large each argument should be.)

* Update the data structures used for exporting kernel history data to
include a version number as well as the length of history arguments.

* All [2] existing users of kernhist(9) have had their format strings
updated. Each format specifier now includes an explicit length
modifier 'j' to refer to numeric values of the size of uintmax_t.

* All [2] existing users of kernhist(9) have had their format strings
updated to replace uses of "%p" with "%#jx", and the pointer
arguments are now cast to (uintptr_t) before being subsequently cast
to (uintmax_t). This is needed to avoid compiler warnings about
casting "pointer to integer of a different size."

* All [2] existing users of kernhist(9) have had instances of "%s" or
"%c" format strings replaced with numeric formats; several instances
of mis-match between format string and argument list have been fixed.

* vmstat(1) has been modified to handle the new size of arguments in the
history data as exported by sysctl(9).

* vmstat(1) now provides a warning message if the history requested with
the -u option does not exist (previously, this condition was silently
ignored, with only a single blank line being printed).

* vmstat(1) now checks the version and argument length included in the
data exported via sysctl(9) and exits if they do not match the values
with which vmstat was built.

* The kernhist(9) man-page has been updated to note the additional
requirements imposed on the format strings, along with several other
minor changes and enhancements.

[1] It would have been possible to use an explicit length (for example,
uint64_t) for the history arguments. But that would require another
"rototill" of all the users in the future when we add support for an
architecture that supports a larger size. Also, the printf(3) format
specifiers for explicitly-sized values, such as "%"PRIu64, are much
more verbose (and less aesthetically appealing, IMHO) than simply
using "%ju".

[2] I've tried very hard to find "all [the] existing users of kernhist(9)"
but it is possible that I've missed some of them. I would be glad to
update any stragglers that anyone identifies.


Revision tags: nick-nhusb-base-20170825
# 1.275 04-Aug-2017 mrg

normalise a BIOHIST log message


Revision tags: perseant-stdc-iso10646-base
# 1.274 08-Jun-2017 chs

move some buffer cache internals declarations from buf.h to vfs_bio.c.
this is needed to avoid name conflicts with ZFS and also
makes it clearer that other code shouldn't be messing with these.
remove the LFS debug code that poked around in bufqueues and
remove the BQ_EMPTY bufqueue since nothing uses it anymore.
provide a function to let LFS and wapbl read the value of nbuf for now.


Revision tags: netbsd-8-base
# 1.273 25-May-2017 pgoyette

branches: 1.273.2;
When logging a history record for biowait(), include the return address
as a parameter, to identify to which of the many calls to biowait() the
record refers.


Revision tags: prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.272 05-Apr-2017 jdolecek

expose disk device FUA/DPO support via DIOCGCACHE, and allow the flags
to be set for I/O; implement support in sd(4) and nvme(4)

discussed on tech-kern


# 1.271 21-Mar-2017 skrll

Use brelsel while the bufcache_lock is held rather than dropping it
and re-taking / dropping it in brelse


Revision tags: pgoyette-localcount-20170320
# 1.270 18-Mar-2017 riastradh

Nix trailing whitespace.


Revision tags: nick-nhusb-base-20170204
# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

branches: 1.267.2;
Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-2-RELEASE netbsd-7-1-2-RELEASE netbsd-7-1-1-RELEASE netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


# 1.276 28-Oct-2017 pgoyette

Update the kernhist(9) kernel history code to address issues identified
in PR kern/52639, as well as some general cleaning-up...

(As proposed on tech-kern@ with additional changes and enhancements.)

Details of changes:

* All history arguments are now stored as uintmax_t values[1], both in
the kernel and in the structures used for exporting the history data
to userland via sysctl(9). This avoids problems on some architectures
where passing a 64-bit (or larger) value to printf(3) can cause it to
process the value as multiple arguments. (This can be particularly
problematic when printf()'s format string is not a literal, since in
that case the compiler cannot know how large each argument should be.)

* Update the data structures used for exporting kernel history data to
include a version number as well as the length of history arguments.

* All [2] existing users of kernhist(9) have had their format strings
updated. Each format specifier now includes an explicit length
modifier 'j' to refer to numeric values of the size of uintmax_t.

* All [2] existing users of kernhist(9) have had their format strings
updated to replace uses of "%p" with "%#jx", and the pointer
arguments are now cast to (uintptr_t) before being subsequently cast
to (uintmax_t). This is needed to avoid compiler warnings about
casting "pointer to integer of a different size."

* All [2] existing users of kernhist(9) have had instances of "%s" or
"%c" format strings replaced with numeric formats; several instances
of mis-match between format string and argument list have been fixed.

* vmstat(1) has been modified to handle the new size of arguments in the
history data as exported by sysctl(9).

* vmstat(1) now provides a warning message if the history requested with
the -u option does not exist (previously, this condition was silently
ignored, with only a single blank line being printed).

* vmstat(1) now checks the version and argument length included in the
data exported via sysctl(9) and exits if they do not match the values
with which vmstat was built.

* The kernhist(9) man-page has been updated to note the additional
requirements imposed on the format strings, along with several other
minor changes and enhancements.

[1] It would have been possible to use an explicit length (for example,
uint64_t) for the history arguments. But that would require another
"rototill" of all the users in the future when we add support for an
architecture that supports a larger size. Also, the printf(3) format
specifiers for explicitly-sized values, such as "%"PRIu64, are much
more verbose (and less aesthetically appealing, IMHO) than simply
using "%ju".

[2] I've tried very hard to find "all [the] existing users of kernhist(9)"
but it is possible that I've missed some of them. I would be glad to
update any stragglers that anyone identifies.


Revision tags: nick-nhusb-base-20170825
# 1.275 04-Aug-2017 mrg

normalise a BIOHIST log message


Revision tags: perseant-stdc-iso10646-base
# 1.274 08-Jun-2017 chs

move some buffer cache internals declarations from buf.h to vfs_bio.c.
this is needed to avoid name conflicts with ZFS and also
makes it clearer that other code shouldn't be messing with these.
remove the LFS debug code that poked around in bufqueues and
remove the BQ_EMPTY bufqueue since nothing uses it anymore.
provide a function to let LFS and wapbl read the value of nbuf for now.


Revision tags: netbsd-8-base
# 1.273 25-May-2017 pgoyette

branches: 1.273.2;
When logging a history record for biowait(), include the return address
as a parameter, to identify to which of the many calls to biowait() the
record refers.


Revision tags: prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.272 05-Apr-2017 jdolecek

expose disk device FUA/DPO support via DIOCGCACHE, and allow the flags
to be set for I/O; implement support in sd(4) and nvme(4)

discussed on tech-kern


# 1.271 21-Mar-2017 skrll

Use brelsel while the bufcache_lock is held rather than dropping it
and re-taking / dropping it in brelse


Revision tags: pgoyette-localcount-20170320
# 1.270 18-Mar-2017 riastradh

Nix trailing whitespace.


Revision tags: nick-nhusb-base-20170204
# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

branches: 1.267.2;
Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


# 1.275 04-Aug-2017 mrg

normalise a BIOHIST log message


Revision tags: perseant-stdc-iso10646-base
# 1.274 08-Jun-2017 chs

move some buffer cache internals declarations from buf.h to vfs_bio.c.
this is needed to avoid name conflicts with ZFS and also
makes it clearer that other code shouldn't be messing with these.
remove the LFS debug code that poked around in bufqueues and
remove the BQ_EMPTY bufqueue since nothing uses it anymore.
provide a function to let LFS and wapbl read the value of nbuf for now.


Revision tags: netbsd-8-base
# 1.273 25-May-2017 pgoyette

When logging a history record for biowait(), include the return address
as a parameter, to identify to which of the many calls to biowait() the
record refers.


Revision tags: prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.272 05-Apr-2017 jdolecek

expose disk device FUA/DPO support via DIOCGCACHE, and allow the flags
to be set for I/O; implement support in sd(4) and nvme(4)

discussed on tech-kern


# 1.271 21-Mar-2017 skrll

Use brelsel while the bufcache_lock is held rather than dropping it
and re-taking / dropping it in brelse


Revision tags: pgoyette-localcount-20170320
# 1.270 18-Mar-2017 riastradh

Nix trailing whitespace.


Revision tags: nick-nhusb-base-20170204
# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

branches: 1.267.2;
Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


# 1.274 08-Jun-2017 chs

move some buffer cache internals declarations from buf.h to vfs_bio.c.
this is needed to avoid name conflicts with ZFS and also
makes it clearer that other code shouldn't be messing with these.
remove the LFS debug code that poked around in bufqueues and
remove the BQ_EMPTY bufqueue since nothing uses it anymore.
provide a function to let LFS and wapbl read the value of nbuf for now.


Revision tags: netbsd-8-base
# 1.273 25-May-2017 pgoyette

When logging a history record for biowait(), include the return address
as a parameter, to identify to which of the many calls to biowait() the
record refers.


Revision tags: prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.272 05-Apr-2017 jdolecek

expose disk device FUA/DPO support via DIOCGCACHE, and allow the flags
to be set for I/O; implement support in sd(4) and nvme(4)

discussed on tech-kern


# 1.271 21-Mar-2017 skrll

Use brelsel while the bufcache_lock is held rather than dropping it
and re-taking / dropping it in brelse


Revision tags: pgoyette-localcount-20170320
# 1.270 18-Mar-2017 riastradh

Nix trailing whitespace.


Revision tags: nick-nhusb-base-20170204
# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

branches: 1.267.2;
Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


# 1.273 25-May-2017 pgoyette

When logging a history record for biowait(), include the return address
as a parameter, to identify to which of the many calls to biowait() the
record refers.


Revision tags: prg-localcount2-base3 prg-localcount2-base2 prg-localcount2-base1 prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.272 05-Apr-2017 jdolecek

expose disk device FUA/DPO support via DIOCGCACHE, and allow the flags
to be set for I/O; implement support in sd(4) and nvme(4)

discussed on tech-kern


# 1.271 21-Mar-2017 skrll

Use brelsel while the bufcache_lock is held rather than dropping it
and re-taking / dropping it in brelse


Revision tags: pgoyette-localcount-20170320
# 1.270 18-Mar-2017 riastradh

Nix trailing whitespace.


Revision tags: nick-nhusb-base-20170204
# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

branches: 1.267.2;
Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


Revision tags: prg-localcount2-base pgoyette-localcount-20170426 bouyer-socketcan-base1 jdolecek-ncq-base
# 1.272 05-Apr-2017 jdolecek

expose disk device FUA/DPO support via DIOCGCACHE, and allow the flags
to be set for I/O; implement support in sd(4) and nvme(4)

discussed on tech-kern


# 1.271 21-Mar-2017 skrll

Use brelsel while the bufcache_lock is held rather than dropping it
and re-taking / dropping it in brelse


Revision tags: pgoyette-localcount-20170320
# 1.270 18-Mar-2017 riastradh

Nix trailing whitespace.


Revision tags: nick-nhusb-base-20170204
# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

branches: 1.267.2;
Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-1-RELEASE netbsd-7-1-RC2 netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


# 1.269 20-Jan-2017 skrll

Fix build


# 1.268 20-Jan-2017 skrll

Simplify getiobuf. buf_init already does bp->b_objlock == &buffer_lock


Revision tags: bouyer-socketcan-base pgoyette-localcount-20170107
# 1.267 28-Dec-2016 pgoyette

Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-nhusb-base-20170116 netbsd-7-1-RC1 netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.


# 1.267 28-Dec-2016 pgoyette

Remove some extraneous whitespace


# 1.266 27-Dec-2016 pgoyette

Decouple BIOHIST from other users of KERNHIST.


# 1.265 26-Dec-2016 pgoyette

Fix locking so we don't release the lock between the time we check the
tailq (for being non-empty) and the time we remove an entry.


# 1.264 26-Dec-2016 pgoyette

Add a BIOHIST option. As mentioned on tech-kern.


# 1.263 18-Dec-2016 dholland

typo in comment


Revision tags: nick-nhusb-base-20161204 pgoyette-localcount-20161104
# 1.262 28-Oct-2016 jdolecek

fixup comment


Revision tags: nick-nhusb-base-20161004
# 1.261 29-Sep-2016 christos

Allow sparc kernels to build with SSP by using a constant PAGE_SIZE...


Revision tags: localcount-20160914 pgoyette-localcount-20160806
# 1.260 31-Jul-2016 dholland

In bwrite, add assertion that vp != NULL. (vp is the vnode from the
buffer being written.)

There's some logic here that carefully checks for vp being null, and
other logic that will crash if it is. It appears that it's all
needless paranoia. See tech-kern for more info.

Unless someone sees the assertion go off (in which case a lot more
investigation is needed) I or someone will clean out the logic at some
future point.

Spotted by coypu.


Revision tags: pgoyette-localcount-20160726 pgoyette-localcount-base nick-nhusb-base-20160907 nick-nhusb-base-20160529 nick-nhusb-base-20160422 nick-nhusb-base-20160319
# 1.259 01-Feb-2016 riz

branches: 1.259.2;
Implement the 'io' provider for DTrace. From riastradh@, with
fixes from me.


# 1.258 11-Jan-2016 dholland

Whatever the point of this "biodone_vfs" global function pointer is
(something rumpity?) declare it properly in a header file instead of
in secret where its types can diverge.


# 1.257 01-Jan-2016 martin

KASSERT->KASSERTMSG to allow debugging a double-free'd buffer in ddb.


Revision tags: nick-nhusb-base-20151226 nick-nhusb-base-20150921
# 1.256 24-Aug-2015 pooka

to garnish, dust with _KERNEL_OPT


Revision tags: nick-nhusb-base-20150606 nick-nhusb-base-20150406
# 1.255 28-Mar-2015 maxv

Remove the 'cred' argument from bread(). Remove a now unused var in
ffs_snapshot.c. Update the man page accordingly.

ok hannken@


# 1.254 28-Mar-2015 maxv

Remove the 'cred' argument from breadn(), and update the man page
accordingly.

ok hannken@


# 1.253 28-Mar-2015 maxv

Remove the 'cred' argument from bio_doread().


Revision tags: nick-nhusb-base
# 1.252 08-Sep-2014 joerg

branches: 1.252.2;
Replace random with cprng_fast32. Reorganise computation to replace
(32bit) division with (long) multiplication.


# 1.251 05-Sep-2014 matt

Don't next structure and enum definitions.
Don't use C++ keywords new, try, class, private, etc.


Revision tags: netbsd-7-0-2-RELEASE netbsd-7-nhusb-base netbsd-7-0-1-RELEASE netbsd-7-0-RELEASE netbsd-7-0-RC3 netbsd-7-0-RC2 netbsd-7-0-RC1 netbsd-7-base tls-earlyentropy-base tls-maxphys-base
# 1.250 25-May-2014 pooka

Call biodone() in the bdev_strategy() error via a pointer. Decouples
subr_devsw from VFS -- not that I/O buffers are _VFS_ entities -- and
eliminates the last weak alias from librump, which means things now
fully work on glibc (w/o LD_DYNAMIC_WEAK) and musl.

The whole code path is suspect anyway, since nothing prevents the device
from escaping after the lookup, suggesting that the whole error path
should be handled by the caller, but oh well.


Revision tags: yamt-pagecache-base9 riastradh-xf86-video-intel-2-7-1-pre-2-21-15 riastradh-drm2-base3 rmind-smpnet-nbase rmind-smpnet-base
# 1.249 25-Feb-2014 pooka

branches: 1.249.2;
Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.

Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.


# 1.248 25-Oct-2013 martin

Mark a diagnostic-only variable


# 1.247 30-Sep-2013 hannken

Replace macro v_specmountpoint with two functions spec_node_getmountedfs()
and spec_node_setmountedfs() to manage the file system mounted on a device.
Assert the device is a block device.

Welcome to 6.99.24

Discussed on tech-kern@ some time ago.

Reviewed by: David Holland <dholland@netbsd.org>


# 1.246 15-Sep-2013 martin

Remove unused variable


Revision tags: riastradh-drm2-base2 riastradh-drm2-base1 riastradh-drm2-base
# 1.245 28-Jun-2013 christos

branches: 1.245.2;
remove useless initialization
http://m00nbsd.net/ae123a9bae03f7dde5c6d654412daf5a.html


# 1.244 19-May-2013 njoly

Redo case where buf_map is not yet mapped when buf_memcalc() is called
early from pmap_bootstrap(). Makes alpha, and probably others, boot
again.


Revision tags: agc-symver-base
# 1.243 18-Mar-2013 para

calculate vnode cache size based on the resource it gets allocated from
this stops setting kern.maxvnodes to high so it exhausts available space in kmem

http://mail-index.netbsd.org/tech-kern/2013/03/08/msg015095.html


Revision tags: yamt-pagecache-base8
# 1.242 30-Dec-2012 hannken

Always call brelse() on error for breadn() too.


# 1.241 29-Dec-2012 christos

Always call brelse() on error. Otherwise a possible error from bread() will
cause the buffer to stay lock and we end up blocking forever in
VOP_CLOSE->spec_close->vinvalbuf->bbysy since the buffer is marked busy
but there is no I/O pending.
This caused my laptop to hang on boot_findwedge because:
findroot: unable to read block 358331527 of dev dk0 (22)


# 1.240 20-Dec-2012 hannken

Change bread() and breadn() to never return a buffer on
error and modify all callers to not brelse() on error.

Welcome to 6.99.16

PR kern/46282 (6.0_BETA crash: msdosfs_bmap -> pcbmap -> bread -> bio_doread)


Revision tags: yamt-pagecache-base7 yamt-pagecache-base6
# 1.239 03-Jun-2012 dsl

branches: 1.239.2;
Use separate temporaries for the 'int' percentage and the 'long'
water marks.
Previous paniced on sparc64 due to a misaligned copy.


# 1.238 03-Jun-2012 dsl

Fix processing of vm.bufmem_lowater and vm.bufmem_hiwater on 64bit systems.
The values are 'u_long' so copying them into an 'int' temporary
(to avoid writing an out of range value into the actual variable)
doesn't work too well at all.
Shows up on amd64 now that the sysctl values are marked as 64bit.
sparc64 must have been badly broken for ages.


# 1.237 02-Jun-2012 dsl

Add some pre-processor magic to verify that the type of the data item
passed to sysctl_createv() actually matches the declared type for
the item itself.
In the places where the caller specifies a function and a structure
address (typically the 'softc') an explicit (void *) cast is now needed.
Fixes bugs in sys/dev/acpi/asus_acpi.c sys/dev/bluetooth/bcsp.c
sys/kern/vfs_bio.c sys/miscfs/syncfs/sync_subr.c and setting
AcpiGbl_EnableAmlDebugObject.
(mostly passing the address of a uint64_t when typed as CTLTYPE_INT).
I've test built quite a few kernels, but there may be some unfixed MD
fallout. Most likely passing &char[] to char *.
Also add CTLFLAG_UNSIGNED for unsiged decimals - not set yet.


Revision tags: jmcneill-usbmp-base10 yamt-pagecache-base5 jmcneill-usbmp-base9 yamt-pagecache-base4 jmcneill-usbmp-base8 jmcneill-usbmp-base7 jmcneill-usbmp-base6 jmcneill-usbmp-base5 jmcneill-usbmp-base4 jmcneill-usbmp-base3 jmcneill-usbmp-base2 netbsd-6-base
# 1.236 01-Feb-2012 para

branches: 1.236.2;
allocate uareas and buffers from kernel_map again
add code to drain pools if kmem_arena runs out of space


# 1.235 28-Jan-2012 rmind

pool_page_alloc, pool_page_alloc_meta: avoid extra compare, use const.
ffs_mountfs,sys_swapctl: replace memset with kmem_zalloc.
sys_swapctl: move kmem_free outside the lock path.
uvm_init: fix comment, remove pointless numeration of steps.
uvm_map_enter: remove meflagval variable.
Fix some indentation.


# 1.234 27-Jan-2012 para

extending vmem(9) to be able to allocated resources for it's own needs.
simplifying uvm_map handling (no special kernel entries anymore no relocking)
make malloc(9) a thin wrapper around kmem(9)
(with private interface for interrupt safety reasons)

releng@ acknowledged


# 1.233 26-Jan-2012 rmind

sysctl_dobuf: re-acquire the sysctl lock on retry path. PR/45827.


Revision tags: jmcneill-usbmp-pre-base2 jmcneill-usbmp-base jmcneill-audiomp3-base yamt-pagecache-base3 yamt-pagecache-base2 yamt-pagecache-base
# 1.232 05-Oct-2011 jakllsch

branches: 1.232.2; 1.232.6;
Make parts of this this somewhat less gross by using ilog2() from
<sys/bitops.h>. Additionally, concoct shorter wchan names for >=1 MiB
buf pools using the irrelevantly-more-correct format specifier for u_int.


# 1.231 11-Jul-2011 hannken

Change VOP_BWRITE() to take a vnode as its first argument like all other
VOPs do. Layered file systems no longer have to modify bp->b_vp and run
into trouble when an async VOP_BWRITE() uses the wrong vnode.

- change all occurences of VOP_BWRITE(bp) to VOP_BWRITE(bp->b_vp, bp).
- remove layer_bwrite().
- welcome to 5.99.55

Adresses PR kern/38762 panic: vwakeup: neg numoutput

No objections from tech-kern@.


# 1.230 12-Jun-2011 rmind

Welcome to 5.99.53! Merge rmind-uvmplock branch:

- Reorganize locking in UVM and provide extra serialisation for pmap(9).
New lock order: [vmpage-owner-lock] -> pmap-lock.

- Simplify locking in some pmap(9) modules by removing P->V locking.

- Use lock object on vmobjlock (and thus vnode_t::v_interlock) to share
the locks amongst UVM objects where necessary (tmpfs, layerfs, unionfs).

- Rewrite and optimise x86 TLB shootdown code, make it simpler and cleaner.
Add TLBSTATS option for x86 to collect statistics about TLB shootdowns.

- Unify /dev/mem et al in MI code and provide required locking (removes
kernel-lock on some ports). Also, avoid cache-aliasing issues.

Thanks to Andrew Doran and Joerg Sonnenberger, as their initial patches
formed the core changes of this branch.


Revision tags: rmind-uvmplock-nbase cherry-xenmp-base rmind-uvmplock-base
# 1.229 24-Apr-2011 rmind

branches: 1.229.2;
- Replace few malloc(9) uses with kmem(9).
- Rename buf_malloc() to buf_alloc(), fix comments.
- Remove some unnecessary inclusions.


# 1.228 23-Mar-2011 rmind

G/C count_lock_queue (unused for 12 years)


Revision tags: bouyer-quota2-nbase bouyer-quota2-base
# 1.227 17-Jan-2011 uebayasi

Include internal definitions (uvm/uvm.h) only where necessary.


Revision tags: jruoho-x86intr-base matt-mips64-premerge-20101231
# 1.226 22-Dec-2010 reinoud

branches: 1.226.2;
Fix nestio's behavior on error.

The mbp->b_resid is used to track if all the nested buffers have been issued
and reported back. When the last buffer calls in, mbp->b_resid becomes zero
and biodone(mbp) is called. This is fine as long as there are no errors.

If a read-error does occure in one of the nested buffers, the mbp->b_error is
set and on its call to biodone(mbp), with mbp->b_resid is zero, physio()
panics since it asserts that IF an error is set on a buffer, there should be a
residual amount of data left to transfered.

The patch fixes this case by setting mbp->b_resid back to mbp->b_bcount on
mbp->b_error just before biodone(mbp).

This behaviour is consistent with normal buffer issueing. It either succeeds
or doesn't succeed.


# 1.225 12-Dec-2010 hannken

brelsel: Clear B_COWDONE flag on clean (! BO_DELWRI) buffer. B_COWDONE is set
if the buffer was read with intention to modify but the caller changed its mind.

This error could lead to snapshot corruption when a buffer with B_COWDONE set
resides on the freelist and we create a new snapshot.


Revision tags: uebayasi-xip-base4
# 1.224 02-Nov-2010 pooka

Don't sleep forever if hz < 25.

from Alessandro Forin


Revision tags: uebayasi-xip-base3 yamt-nfs-mp-base11 uebayasi-xip-base2 yamt-nfs-mp-base10 uebayasi-xip-base1 yamt-nfs-mp-base9
# 1.223 02-Mar-2010 pooka

branches: 1.223.2;
fs_ffs.h is no longer required (since the death of bufops / softdep)


Revision tags: uebayasi-xip-base matt-premerge-20091211
# 1.222 17-Nov-2009 pooka

branches: 1.222.2;
Add a comment saying "name" to pool_init() is never freed (fixing
requires touching pool implementation). No biggie, though, since
the pools themselves are never freed.


# 1.221 11-Nov-2009 rmind

Add a small comment on buffer cache locking, fix mark letter b_objlock.


# 1.220 11-Nov-2009 rmind

G/C unused breada() and bdirty().


# 1.219 05-Nov-2009 pooka

Excommunicate comment not abiding to the 80col dogma.
(well, turns out it was no longer valid either)


Revision tags: yamt-nfs-mp-base8 yamt-nfs-mp-base7 jymxensuspend-base yamt-nfs-mp-base6 yamt-nfs-mp-base5 yamt-nfs-mp-base4 jym-xensuspend-nbase yamt-nfs-mp-base3 nick-hppapmap-base4 nick-hppapmap-base3 jym-xensuspend-base nick-hppapmap-base
# 1.218 11-Mar-2009 mrg

like KERN_FILE2: *do* update "needed" when there is no count. we want
userland to know what sort of size to provide..

while here, slightly normalise the previous to init_sysctl.c.


Revision tags: nick-hppapmap-base2
# 1.217 23-Feb-2009 ad

Fix some comments.


# 1.216 22-Feb-2009 ad

PR kern/26878 FFSv2 + softdep = livelock (no free ram)
PR kern/16942 panic with softdep and quotas
PR kern/19565 panic: softdep_write_inodeblock: indirect pointer #1 mismatch
PR kern/26274 softdep panic: allocdirect_merge: ...
PR kern/26374 Long delay before non-root users can write to softdep partitions
PR kern/28621 1.6.x "vp != NULL" panic in ffs_softdep.c:4653 while unmounting a softdep (+quota) filesystem
PR kern/29513 FFS+Softdep panic with unfsck-able file-corruption
PR kern/31544 The ffs softdep code appears to fail to write dirty bits to disk
PR kern/31981 stopping scsi disk can cause panic (softdep)
PR kern/32116 kernel panic in softdep (assertion failure)
PR kern/32532 softdep_trackbufs deadlock
PR kern/37191 softdep: locking against myself
PR kern/40474 Kernel panic after remounting raid root with softdep

Retire softdep, pass 2. As discussed and later formally announced on the
mailing lists.


Revision tags: haad-dm-base2 haad-nbase2 haad-dm-base mjf-devfs2-base
# 1.215 07-Dec-2008 pooka

branches: 1.215.2;
Move some sysctl node creations away from linksets and into the
constructors for subsystems.

XXX: CTLFLAG_PERMANENT is non-sensible.


Revision tags: ad-audiomp2-base
# 1.214 16-Nov-2008 joerg

Backout revision 1.212 and add a comment that short-cutting the WAPBL
case is not possible. The buffer length has changed and the rounded size
may not have, essentially changing the transaction size. Reported by
various users and in PR 39898.


# 1.213 11-Nov-2008 joerg

Move WAPL replay handling from bread() into ufs_strategy.
This changes the order of hook processing as the copy-on-write handlers
are called after the journal processing. This makes more sense as the
journal overwrite is logically part of the disk IO.


# 1.212 10-Nov-2008 joerg

If the size of the buffer didn't change, don't bother updating the WAPBL
accounting as it won't change either.


# 1.211 04-Nov-2008 reinoud

Don't dereference bp->b_vp->v_mount if its vnode type is VT_VNON. I dont
know if this masks a bug but with a machine having a ffs+wapbl mount, NFS
mounts and a ntfs mount this paniced the machine on suspend.


Revision tags: netbsd-5-1-2-RELEASE netbsd-5-1-1-RELEASE matt-nb5-mips64-premerge-20101231 matt-nb5-pq3-base netbsd-5-1-RELEASE netbsd-5-1-RC4 matt-nb5-mips64-k15 netbsd-5-1-RC3 netbsd-5-1-RC2 netbsd-5-1-RC1 netbsd-5-0-2-RELEASE matt-nb5-mips64-premerge-20091211 matt-nb5-mips64-u2-k2-k4-k7-k8-k9 matt-nb4-mips64-k7-u2a-k9b matt-nb5-mips64-u1-k1-k5 netbsd-5-0-1-RELEASE netbsd-5-0-RELEASE netbsd-5-0-RC4 netbsd-5-0-RC3 netbsd-5-0-RC2 netbsd-5-0-RC1 netbsd-5-base matt-mips64-base2 haad-dm-base1 wrstuden-revivesa-base-4 wrstuden-revivesa-base-3
# 1.210 11-Sep-2008 hannken

branches: 1.210.2; 1.210.4; 1.210.6; 1.210.10;
nestiobuf_setup(): Initialize b_dev from master buffer.


Revision tags: wrstuden-revivesa-base-2
# 1.209 30-Aug-2008 reinoud

Accidental commit, but asserts buffer cache lock held


# 1.208 31-Jul-2008 simonb

Merge the simonb-wapbl branch. From the original branch commit:

Add Wasabi System's WAPBL (Write Ahead Physical Block Logging)
journaling code. Originally written by Darrin B. Jewell while
at Wasabi and updated to -current by Antti Kantee, Andy Doran,
Greg Oster and Simon Burge.

OK'd by core@, releng@.


Revision tags: simonb-wapbl-nbase simonb-wapbl-base
# 1.207 14-Jul-2008 hannken

bdwrite(): The COWDONE check may fire for special vnodes with mounted
non-ffs file system. Remove VBLK vnodes from the check.

Should fix PR kern/38892


# 1.206 06-Jul-2008 bouyer

branches: 1.206.2;
kern/39052: fix broken assertion. We can have a BC_BUSY buffer in the LRU
queue, if it's being flushed. But in this case, we'll also have B_VFLUSH.

While there fix checkfreelist() so that it can be used to check that a
buffer is not in the free lists.


Revision tags: wrstuden-revivesa-base-1 wrstuden-revivesa-base
# 1.205 17-Jun-2008 mlelstv

Drop !cv_has_waiters assertion.

bdirty() is called from within biodone() processing before
waiters have been woken up and removed.

N.B. it is also used by smbfs.


# 1.204 17-Jun-2008 reinoud

Mark a buffer `busy` in getnewbuf() when it came from the pool_cache since
its not on a free list.

Also change buf_init() to not automatically mark buffers `busy' since this
only makes sense for bufcache buffers.

Mark all buf_init'd buffers 'busy' on the places where they ought to be
flagged as such to not confuse the buffer cache.

Fixes PR 38923.


Revision tags: yamt-pf42-base4
# 1.203 16-Jun-2008 ad

Sprinkle more assertions.


# 1.202 06-Jun-2008 ad

branches: 1.202.2;
- Make getiobuf() return buffers marked BUSY.
- Sprinkle more assertions.


Revision tags: yamt-pf42-base3
# 1.201 31-May-2008 ad

biodone2: if the buffer is async or has a callback method, assert that
there are no waiters on b_done (threads in biowait()).


# 1.200 31-May-2008 ad

tsleep -> kpause


# 1.199 26-May-2008 ad

brelse: always wakeup on b_busy, in case BC_WANTED is not set.


Revision tags: hpcarm-cleanup-nbase yamt-pf42-base2
# 1.198 16-May-2008 hannken

Make sure all cached buffers with valid, not yet written data have been
run through copy-on-write. Call fscow_run() with valid data where possible.

The LP_UFSCOW hack is no longer needed to protect ffs_copyonwrite() against
endless recursion.

- Add a flag B_MODIFY to bread(), breada() and breadn(). If set the caller
intends to modify the buffer returned.

- Always run copy-on-write on buffers returned from ffs_balloc().

- Add new function ffs_getblk() that gets a buffer, assigns a new blkno,
may clear the buffer and runs copy-on-write. Process possible errors
from getblk() or fscow_run(). Part of PR kern/38664.

Welcome to 4.99.63

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: yamt-nfs-mp-base2
# 1.197 05-May-2008 ad

branches: 1.197.2;
- Convert hashinit() to use kmem_alloc(). The hash tables can be large
and it's better to not have them in kmem_map.
- Convert a couple of minor items along the way to kmem_alloc().
- Fix some memory leaks.


# 1.196 28-Apr-2008 martin

Remove clause 3 and 4 from TNF licenses


Revision tags: yamt-nfs-mp-base
# 1.195 22-Apr-2008 reinoud

branches: 1.195.2;
When using nested buffers, allow one erroring-out nested buffer to
error-out the master buffer.

The old setup was undeterministic since a later sheduled nested buffer
could clear the error again since there is no B_ERROR flag anymore. It also
would discard the error the nested buffer returned.


Revision tags: yamt-pf42-baseX yamt-pf42-base
# 1.194 27-Mar-2008 ad

branches: 1.194.2;
Make rusage collection per-LWP and collate in the appropriate places.
cloned threads need a little bit more work but the locking needs to
be fixed first.


Revision tags: ad-socklock-base1
# 1.193 25-Mar-2008 yamt

- for some ports, especially for ones without pmap_growkernel,
buf_memcalc is used by bootstrap as well. fix NULL dereference for them.
- limit kva usage for each cache to 20% of vm_map. XXX a bit arbitrary.
- add a comment.


Revision tags: yamt-lazymbuf-base15 yamt-lazymbuf-base14
# 1.192 23-Mar-2008 yamt

when calculating some cache sizes, consider the amount of available kva.
PR/33185.


# 1.191 23-Mar-2008 yamt

make buf_map static.


Revision tags: keiichi-mipv6-nbase keiichi-mipv6-base matt-armv6-nbase
# 1.190 29-Feb-2008 yamt

update a comment.


Revision tags: nick-net80211-sync-base hpcarm-cleanup-base
# 1.189 20-Feb-2008 matt

branches: 1.189.2; 1.189.6;
Merge all the *different* definitions of bufqueues into one common one.


Revision tags: mjf-devfs-base
# 1.188 15-Feb-2008 ad

Give bbusy() an interlock argument. If the we need to wait for the buffer,
the interlock is dropped and reacquired when awoken. This allows for
busying buffers attached to a list that is not locked by bufcache_lock.


# 1.187 15-Feb-2008 ad

The buffer LOCKED flag need not be under the protection of bufcache_lock,
BUSY is enough.


# 1.186 02-Feb-2008 hannken

BO_COWDONE -> B_COWDONE: this flag is tested/modified from the thread owning
the buffer and therefore needs no protection by a mutex.

Ok: Andrew Doran <ad@netbsd.org>


Revision tags: bouyer-xeni386-nbase bouyer-xeni386-base matt-armv6-base
# 1.185 07-Jan-2008 ad

Patch up sysctl locking:

- Lock processes, credentials, filehead etc correctly.
- Acquire a read hold on sysctl_treelock if only doing a query.
- Don't wire down the output buffer. It doesn't work correctly and the code
regularly does long term sleeps with it held - it's not worth it.
- Don't hold locks other than sysctl_lock while doing copyout().
- Drop sysctl_lock while doing copyout / allocating memory in a few places.
- Don't take kernel_lock for sysctl.
- Fix a number of bugs spotted along the way


# 1.184 07-Jan-2008 ad

bwrite, bdwrite: bufcache_lock must be held for reassignbuf.


# 1.183 02-Jan-2008 ad

Merge vmlocking2 to head.


Revision tags: vmlocking2-base3
# 1.182 24-Dec-2007 ad

b_un.b_addr -> b_data


Revision tags: yamt-kmem-base3 cube-autoconf-base yamt-kmem-base2 yamt-kmem-base vmlocking2-base2 reinoud-bufcleanup-nbase vmlocking2-base1 vmlocking-nbase jmcneill-pm-base reinoud-bufcleanup-base
# 1.181 02-Dec-2007 hannken

branches: 1.181.2; 1.181.6;
Fscow_run(): add a flag "bool data_valid" to note still valid data.
Buffers run through copy-on-write are marked B_COWDONE. This condition
is valid until the buffer has run through bwrite() and gets cleared from
biodone().

Welcome to 4.99.39.

Reviewed by: YAMAMOTO Takashi <yamt@netbsd.org>


Revision tags: jmcneill-base bouyer-xenamd64-base2 bouyer-xenamd64-base
# 1.180 21-Oct-2007 martin

branches: 1.180.2;
Fix vm.bufmem* sysctl nodes for 64bit archs by making them uint64_t,
as discussed on tech-kern. No requests for binary compat - so don't
bother to version them.


Revision tags: yamt-x86pmap-base4 yamt-x86pmap-base3 vmlocking-base
# 1.179 08-Oct-2007 ad

branches: 1.179.2;
Merge brelse() changes from the vmlocking branch.


Revision tags: yamt-x86pmap-base2 yamt-x86pmap-base
# 1.178 16-Sep-2007 dsl

branches: 1.178.2;
Put the RCSID before any other headers


Revision tags: nick-csl-alignment-base5
# 1.177 01-Sep-2007 pooka

Make bioops a pointer and point it to the softdeps struct in softdep
init. Decouples "options SOFTDEP" from the main kernel and ffs code.


# 1.176 11-Aug-2007 pooka

branches: 1.176.2;
POOL_INIT -> pool_init, we need to call bufinit() anyway


Revision tags: matt-mips64-base
# 1.175 29-Jul-2007 ad

branches: 1.175.4; 1.175.6;
B_ERROR is gone.


# 1.174 29-Jul-2007 ad

It's not a good idea for device drivers to modify b_flags, as they don't
need to understand the locking around that field. Instead of setting
B_ERROR, set b_error instead. b_error is 'owned' by whoever completes
the I/O request.


Revision tags: nick-csl-alignment-base mjf-ufs-trans-base
# 1.173 09-Jul-2007 ad

branches: 1.173.2;
Merge some of the less invasive changes from the vmlocking branch:

- kthread, callout, devsw API changes
- select()/poll() improvements
- miscellaneous MT safety improvements


# 1.172 17-May-2007 yamt

merge yamt-idlelwp branch. asked by core@. some ports still needs work.

from doc/BRANCHES:

idle lwp, and some changes depending on it.

1. separate context switching and thread scheduling.
(cf. gmcgarry_ctxsw)
2. implement idle lwp.
3. clean up related MD/MI interfaces.
4. make scheduler(s) modular.


Revision tags: yamt-idlelwp-base8 thorpej-atomic-base
# 1.171 12-Mar-2007 ad

branches: 1.171.2; 1.171.4;
Pass an ipl argument to pool_init/POOL_INIT to be used when initializing
the pool's lock.


# 1.170 04-Mar-2007 christos

branches: 1.170.2;
Kill caddr_t; there will be some MI fallout, but it will be fixed shortly.


Revision tags: ad-audiomp-base
# 1.169 22-Feb-2007 thorpej

TRUE -> true, FALSE -> false


Revision tags: post-newlock2-merge
# 1.168 09-Feb-2007 ad

branches: 1.168.2;
Merge newlock2 to head.


Revision tags: netbsd-4-0-RC3 netbsd-4-0-RC2 netbsd-4-0-RC1 newlock2-nbase yamt-splraiseipl-base5 yamt-splraiseipl-base4 yamt-splraiseipl-base3 newlock2-base netbsd-4-base
# 1.167 14-Nov-2006 reinoud

branches: 1.167.2; 1.167.4;
Export nestiobuf_iodone(). This allows nested iobufs to have a custom
call-back function that can then call the nestiobuf_iodone() to propagate.


# 1.166 01-Nov-2006 yamt

remove some __unused from function parameters.


Revision tags: yamt-splraiseipl-base2
# 1.165 16-Oct-2006 christos

with the introduction of 512 byte buffers, the index in the array is not
the number of kilobytes anymore, so name the pools appropriately.


# 1.164 12-Oct-2006 christos

- sprinkle __unused on function decls.
- fix a couple of unused bugs
- no more -Wno-unused for i386


Revision tags: yamt-splraiseipl-base yamt-pdpolicy-base9
# 1.163 10-Sep-2006 yamt

branches: 1.163.2;
unexport getnewbuf.


Revision tags: rpaulo-netinet-merge-pcb-base
# 1.162 03-Sep-2006 christos

branches: 1.162.2;
use c99 initializers


Revision tags: abandoned-netbsd-4-base yamt-pdpolicy-base8 yamt-pdpolicy-base7 yamt-pdpolicy-base6 chap-midi-nbase gdamore-uart-base chap-midi-base simonb-timecounters-base
# 1.161 25-May-2006 yamt

move wait points for kva from upper layers to vm_map. PR/33185 #1.

XXX there is a concern about interaction with kva fragmentation.
see: http://mail-index.NetBSD.org/tech-kern/2006/05/11/0000.html


Revision tags: yamt-pdpolicy-base5
# 1.160 14-May-2006 elad

branches: 1.160.2;
integrate kauth.


Revision tags: yamt-pdpolicy-base4 elad-kernelauth-base
# 1.159 05-Apr-2006 uwe

Tell config to generate fs_ffs.h as vfs_bio.c checks for defined(FFS).
Include that header in vfs_bio.c so that bioops are not redefined.


Revision tags: yamt-pdpolicy-base3
# 1.158 17-Mar-2006 tls

Add one more buffer pool, for 512-byte buffers. On the one hand, most
systems will never, ever need this -- because they use 8k/1k or even,
these days, 16k/2k or 32k/4k filesystems. On the other hand, when you
do need this, you *really* need it: on anoncvs.netbsd.org, for instance,
where /tmp is 4k/512 and the filesystem contains tens or even hundreds
of thousands of single-frag directories, this essentially doubles the
efficiency of the allocator. Since the overhead of keeping one extra
pool around is minimal, just add it by default.


Revision tags: peter-altq-base yamt-pdpolicy-base2
# 1.157 05-Mar-2006 christos

branches: 1.157.2; 1.157.4;
Move ISSET/SET/CLR macros to sys/types.h


Revision tags: yamt-pdpolicy-base yamt-uio_vmspace-base5
# 1.156 04-Feb-2006 yamt

branches: 1.156.2;
nestiobuf_iodone: remove a comment which is no longer true.


# 1.155 21-Jan-2006 reinoud

branches: 1.155.2; 1.155.4;
Propagate an appropiate error code in nestiobuf_iodone() to the master
buffer when the passed nested buffer has no B_ERROR flag set but not all
was transfered for the nested iobuf extent.

Discussed on tech-kern and ok'd by Takashi


# 1.154 15-Jan-2006 yamt

allocbuf: yield cpu in a loop.


# 1.153 15-Jan-2006 yamt

- use POOL_INIT for bufpool.
- make bufiopool static.


# 1.152 11-Jan-2006 yamt

add nestiobuf api for convenience when splitting a request to several pieces.


# 1.151 07-Jan-2006 yamt

remove B_EINTR as it isn't used anymore.


# 1.150 05-Jan-2006 yamt

use a dedicated buf pool for getiobuf.


# 1.149 04-Jan-2006 yamt

- add simple functions to allocate/free a buffer for i/o.
- make bufpool static.


# 1.148 24-Dec-2005 perry

branches: 1.148.2;
Remove leading __ from __(const|inline|signed|volatile) -- it is obsolete.


# 1.147 11-Dec-2005 christos

merge ktrace-lwp.


Revision tags: yamt-readahead-base3 yamt-readahead-base2 yamt-readahead-pervnode yamt-readahead-perfile yamt-readahead-base yamt-vop-base3 yamt-vop-base2 thorpej-vnode-attr-base yamt-vop-base ktrace-lwp-base
# 1.146 09-Jun-2005 atatat

branches: 1.146.2;
Properly fix the constipated lossage wrt -Wcast-qual and the sysctl
code. I know it's not the prettiest code, but it seems to work rather
well in spite of itself.


# 1.145 29-May-2005 christos

- add const.
- remove unnecessary casts.
- add __UNCONST casts and mark them with XXXUNCONST as necessary.


Revision tags: kent-audio2-base
# 1.144 01-Apr-2005 yamt

merge yamt-km branch.
- don't use managed mappings/backing objects for wired memory allocations.
save some resources like pv_entry. also fix (most of) PR/27030.
- simplify kernel memory management API.
- simplify pmap bootstrap of some ports.
- some related cleanups.


# 1.143 31-Mar-2005 chs

fix validation of new values when setting vm.{hi,low}water. fixes PR 29651.


Revision tags: yamt-km-base4 yamt-km-base3 netbsd-3-base
# 1.142 26-Feb-2005 perry

branches: 1.142.2;
nuke trailing whitespace


Revision tags: yamt-km-base2 yamt-km-base kent-audio1-beforemerge
# 1.141 10-Jan-2005 tls

branches: 1.141.2; 1.141.4;
Users have observed that the amount of memory used by the metadata cache
can in some situations exceed the high-water mark, and stay there once it
gets there. Adjust the canrelease function so that it will immediately
bring us back down to the high-water mark in this situation.

How can this happen at all? Consider a machine with two filesystems, one
with a much larger blocksize than the other. If the small-block filesystem
is very busy, growing the cache up to the high-water mark, and then the
large-block filesystem becomes busy, buffers will be recycled (since we
are at the high-water mark) but _grow each time they're recycled_. Once
we're above the high-water mark, the canrelease call in allocbuf (without
this change) doesn't shrink us back down below it; so things get worse and
worse.


# 1.140 23-Dec-2004 dbj

also define bioops if FFS is not defined.


Revision tags: kent-audio1-base
# 1.139 05-Dec-2004 jrf

Fix previous commit, got bufcache and bufmem messages reversed.


# 1.138 05-Dec-2004 jrf

Change sysctl -d vm.bufcache to say percent of physical memory not
kernel memory. Addresses PR misc/27233. Approved by atatat@netbsd.org.


# 1.137 13-Nov-2004 christos

PR/25749: Peter Postma: Missing splx() in kernel.


# 1.136 04-Oct-2004 enami

- Testing low memory condition to see if we should alloc or not doesn't make
sense, since 1) the condition is quite normal condition and 2) there is
pool between us and uvm.
- Make the step of allocation possibility a bit seamless by moving the origin
of curve from 0 to lowater mark.
Simon told that this helps for interactive performance when there is heavy
disk activity in PR#27057.


# 1.135 04-Oct-2004 enami

Factor out code to set watermark and ensure high > low.


# 1.134 03-Oct-2004 enami

- Don't let pagedaemon sleep while draining buf.
- Estimate amount of memory to free at a time.
Address PR#27057 (and similar hangs I saw several months ago).


# 1.133 03-Oct-2004 enami

x > 15 is always false if x is 0 .. 15.
# XXX: testing free memory here is quite doubtful. also, I guess lowater
# XXX: is better than 0 as origin.


# 1.132 03-Oct-2004 enami

Cheap test first.


# 1.131 18-Sep-2004 yamt

fix allocbuf() O(n**2) behaviour where n is number of AGE buffers
by always tracking amount of buffers on a queue.
bump to 2.0H.


# 1.130 18-Sep-2004 yamt

- add missing function prototypes.
- fix prototype mismatches.


# 1.129 08-Sep-2004 yamt

buf_trim: a buffer grabbed by getnewbuf() should be clean and anonymous.
thus, there's no need to check and handle B_WANTED here.


# 1.128 20-Jun-2004 hannken

- Add flag L_COWINPROGRESS to struct lwp to avoid recursion when
doing copy-on-write.

- Change VFS_SNAPSHOT() to return the snapshot vnode locked.

- Make the IO path for copy-on-write and snapshot-read more lightweight.
Avoids deadlocks where vn_rdwr(...READ...) has a shared lock and needs
to copy-on-write.
Avoids deadlocks/panics where to clean pages the copy-on-write needs
to allocate pages for its VOP_PUTPAGES().

L_COWINPROGRESS part approved by: Jason R. Thorpe <thorpej@netbsd.org>


# 1.127 20-Jun-2004 thorpej

When initializing the buffer cache memory pools where the size <= PAGE_SIZE,
also use the standard allocator on systems that use a direct-mapped memory
segment for mapping pool pages.


# 1.126 20-Jun-2004 thorpej

Don't use PR_IMMEDRELEASE on buffer cache pools. Instead, set a high
water mark of 1, which will have the same effect.

Pointed out back in January by YAMAMOTO Takashi.


# 1.125 25-May-2004 atatat

Remaining sysctl descriptions under kern subtree


# 1.124 25-Apr-2004 yamt

bio_doread: vp is always non-NULL here.


# 1.123 21-Apr-2004 christos

Replace the statfs() family of system calls with statvfs().
Retain binary compatibility.


Revision tags: netbsd-2-0-base
# 1.122 26-Mar-2004 simonb

branches: 1.122.2;
Give buf_lotsfree() a bit of a service:
- Fix a 32-bit overflow that could erroneously return true even if the
currently allocated buffer memory was greater than the high water mark.
- Add an early check for bufmem > hiwater to avoid a needless call to
random().
- Sprinkle some comments.

Add a vm.bufmem sysctl so the current bufmem value can be easily queried
from userland.

Reviewed by Thor Simon.


# 1.121 25-Mar-2004 simonb

More white space nits.


# 1.120 25-Mar-2004 simonb

White-space nit.


# 1.119 24-Mar-2004 atatat

Tango on sysctl_createv() and flags. The flags have all been renamed,
and sysctl_createv() now uses more arguments.


# 1.118 22-Feb-2004 dan

micro-optimisation - if we're going to return 0, do so before doing
other unnecessary work


# 1.117 19-Feb-2004 atatat

Add PTRTOUINT64() and UINT64TOPTR() macros to sys/sysctl.h for use by
kern.proc, kern.proc2, kern.lwp, and kern.buf.

Define more MIB for kern.buf so that specific buffers can be selected
(only all/all is supported right now), and use a 32/64 bit agnostic
structure for communcating buffer information to userland.

Convert systat to the new kern.buf method.

Clean up the vm.buf* handling a little. There's no actual need to
record the dynamically assigned OIDs, since sysctl_data can tell us
what we're looking at.

Oh, and fix a typo in a comment.


# 1.116 16-Feb-2004 yamt

- raise ipl when calling buf_canrelease() because it traverses buffer queue.
- correct/add comments on buf_canrelease().


# 1.115 11-Feb-2004 tls

Fix bug noted by yamt@netbsd.org: the UVM free target is in *pages*,
so the last change has us comparing pages to bytes instead of pages
to buffers! The consequence was to try to free radically less memory
than UVM wanted us to -- though always at least one buffer, which is
probably why the results weren't dire.

This does suggest that buf_canrelease() could be a *lot* more
conservative about how much to release than "2 * page deficit". In
fact, serious trouble seems to ensue if it's not -- when anything
else on the system demands enough pages, we slam down to the low
water mark nd stay there. I've adjusted it to use min(page defecit,
buffer memory / 16), which still isn't quite right but seems better.

Another change: consider the case of an infinite loop that does
"tar xzf pkgsrc.tar.gz ; rm -rf pkgsrc". Each time the rm runs,
all the dead metadata will go on the AGE list -- and, until we hit
the high-water mark, stay there, at which point it may be slowly
recycled. Two adjustments seem to solve this: 1) whack buf_lotsfree()
to return 0 if there's anything on the AGE list; 2) whack buf_canrelease()
to count the memory used by the AGE list and always return at least
that much.

This basically turns the AGE list into a "delayed free" list, since we
can't entirely eliminate it as we can't free pool items from interrupt
context (e.g. from biodone()).

To consider: with the bookkeeping corrected, should buf_drain() move
back to the _end_ of the pagedaemon, and should the calculation then
try to give back at least the current defecit?


# 1.114 30-Jan-2004 tls

Buffer cache fixes to avoid thrashing between high and low water marks
and uncontrolled growth.

The key fix is from Dan Carasone, who noticed that buf_canfree() was
counting in _bytes_ but freeing in _buffers_, which caused the instant
drop to lowater observed by some users.

We now control the rate of growth; the probability of getting a new
allocation is inversely proportional to the current size of the
cache. This idea is from a long-ago conversation with Kirk McKusick
and, if memory serves, was used for the file-system cache in some
other BSD variant at some point in history.

With growth and shrinkage more or less dealt with, we return the
default maximum cache size to 15%. The default _minimum_ cache size
is raised from 1/16 of the maximum cache size to 1/8, since 1/16 was
chosen when the maximum size was 30% of memory.

Finally, after observing the behaviour of the pagedaemon and the
buffer cache drainer under pathological workloads (e.g. a benchmark
that steps through 75% of available memory backwards) I have moved
the call to buf_drain() to the beginning of the pagedaemon from the
end; if the pagedaemon bogs down, it still won't get run as often
as it should, but at least this way it will see the state of the
free count and free target _before_ the scan step does its thing.


# 1.113 27-Jan-2004 dan

Reduce the default BUFCACHE to 10% for now. Too many users are
tripping over this getting too large, and suffering other performance
problems due to the lack of good backpressure shrinking the bufcache
when other memory is required. Again, this tunable should be
revisited when the backpressure mechanism has been improved.

sysctl vm.bufcache can be used to manually tune those rare machines
that might need more than this.

See comments in rev 1.106 for more detail.


# 1.112 25-Jan-2004 hannken

Make VOP_STRATEGY(bp) a real VOP as discussed on tech-kern.

VOP_STRATEGY(bp) is replaced by one of two new functions:

- VOP_STRATEGY(vp, bp) Call the strategy routine of vp for bp.
- DEV_STRATEGY(bp) Call the d_strategy routine of bp->b_dev for bp.

DEV_STRATEGY(bp) is used only for block-to-block device situations.


# 1.111 19-Jan-2004 yamt

bufpool_page_alloc: for no-wait allocations, specify UVM_KMF_TRYLOCK as well.


# 1.110 15-Jan-2004 enami

Obviously, sizeof(u_int) is not enough to copy struct buf.
Prevents ``sysctl -a'' from dumping core.


# 1.109 10-Jan-2004 yamt

reset i/o priority in geteblk() as well.


# 1.108 10-Jan-2004 yamt

store a i/o priority hint in struct buf for buffer queue discipline.


# 1.107 09-Jan-2004 thorpej

Initialize buffer pools with PR_IMMEDRELEASE. Don't use pool_reclaim()
on those pools; it is no longer necessary.


# 1.106 09-Jan-2004 tls

Change BUFCACHE (default hard limit on physmem consumption by metadata
cache) from 30% to 20%. This seems to significantly smooth the oscillation
between "almost no memory available" and "UVM free target available" caused
by the current sudden, heavy backpressure on the metadata cache. We should
revisit this again once the backpressure mechanism is better tuned; ideally,
the hard limit should almost never come into play, because the metadata
cache should gradually give back pages as buffers hit the AGE list and as
the page cache demands them, rather than giving back a big slug of pages
all at once when UVM decides it's in a hurry and fires off the page daemon.

Just how well this adjustment works is likely to vary significantly from
machine to machine depending on I/O mix, filesystem frag size, and total
memory. However, 20% seems to be quite a bit better than 30% on several
systems I've tested and is, coincidentally, more than enough to cache
the entire metadata working set of the AnonCVS server with 100 clients,
which is a useful worst-case stake in the ground...


# 1.105 08-Jan-2004 tls

Add pool_reclaim() on pool to which we just pool_put() a buffer in
buf_mrelease(). Without this, though the pages are returned to the
relevant *pool*, they are never available for any other use in the
system.

Now the backpressure on the physical size of the buffer cache through
the buf_drain() call in the pagedaemon works correctly. If anything,
it may be a bit more aggressive than intended. On my 256MB system,
with vm.bufcache set to the default 30% of physmem, a kernel with this
fix can do 5 simultaneous config/makedep/builds of different NetBSD
kernels in 1313 seconds; with the "traditional" buffer cache code it
requires 1320 seconds. Running "find / -type d -exec ls -l {}" while
the build is going demonstrates that the backpressure is working
correctly: free memory oscillates slowly between close to none and
the UVM target free, and vmstat -m shows a large number of releases
for the buffer pools.

For future work: how is "bufpl" memory returned to the system? This
is not obvious to me (I must be looking in the wrong place). Also,
buf_mrelease() is also called from brelse() in some cases. Would it
be better to add a pool flag causing automatic release of full pages
as they become available (not fragmented)? Jason Thorpe proposed this
and it seems more elegant than cleaning the _entire_ pool only upon
memory pressure.

Greg Oster did a lot of the work of figuring this out. Jason proposed
the use of pool_reclaim as a way to fix it.


# 1.104 06-Jan-2004 atatat

Expose the buf_map symbol so that pmap(1) can find it.

Split the sysctl setup routine into two routines, one for each
"subtree". Perhaps it's a little pedantic, but it's cleaner. Also,
assert that the "kern" and "vm" nodes exist.


# 1.103 04-Jan-2004 pk

bufpool_page_free: pass `buf_map' to uvm_km_free().


# 1.102 31-Dec-2003 pk

getnewbuf: return buffer locked.


# 1.101 30-Dec-2003 thorpej

Consistently use ANSI-style function decls.


# 1.100 30-Dec-2003 pk

Replace the traditional buffer memory management -- based on fixed per buffer
virtual memory reservation and a private pool of memory pages -- by a scheme
based on memory pools.

This allows better utilization of memory because buffers can now be allocated
with a granularity finer than the system's native page size (useful for
filesystems with e.g. 1k or 2k fragment sizes). It also avoids fragmentation
of virtual to physical memory mappings (due to the former fixed virtual
address reservation) resulting in better utilization of MMU resources on some
platforms. Finally, the scheme is more flexible by allowing run-time decisions
on the amount of memory to be used for buffers.

On the other hand, the effectiveness of the LRU queue for buffer recycling
may be somewhat reduced compared to the traditional method since, due to the
nature of the pool based memory allocation, the actual least recently used
buffer may release its memory to a pool different from the one needed by a
newly allocated buffer. However, this effect will kick in only if the
system is under memory pressure.


# 1.99 02-Dec-2003 dbj

when ifdef DEBUG and debug_verify_freelist != 0
then perform an expensive search of the buffer freelists
in brelse and bremfree to verify consistency


# 1.98 02-Dec-2003 dbj

add explanatory comment in bremfree:
We break the TAILQ abstraction in order to efficiently remove a
buffer from its freelist without having to know exactly which
freelist it is on.


# 1.97 08-Nov-2003 dbj

protect a few uses of buf's b_flags with b_interlock


# 1.96 24-Sep-2003 yamt

in getblk(), don't call allocbuf() for B_LOCKED buffers.

LFS misses total size of B_LOCKED buffer (locked_queue_bytes) when
getblk() re-size them.

XXX maybe needs a better fix.


# 1.95 07-Sep-2003 yamt

buffer with B_CALL shouldn't be brelse'ed. assert it.


# 1.94 07-Sep-2003 yamt

bremfree needs bqueue_slock held. assert it.


# 1.93 07-Aug-2003 agc

Move UCB-licensed code from 4-clause to 3-clause licence.

Patches provided by Joel Baker in PR 22364, verified by myself.


# 1.92 09-Apr-2003 yamt

branches: 1.92.2;
remove B_NEEDCOMMIT as it's no longer used.


# 1.91 25-Feb-2003 thorpej

Add a new BUF_INIT() macro which initializes b_dep and b_interlock, and
use it. This fixes a few places where either b_dep or b_interlock were
not properly initialized.


# 1.90 06-Feb-2003 pk

bdwrite(): remove check for MFS major device number (why was 255 changed
to 4096?). In any case, bdevsw_lookup() will take care of it.


# 1.89 06-Feb-2003 pk

In getnewbuf(), release the buffer queue lock before calling bawrite() and
re-acquire it afterward.


# 1.88 06-Feb-2003 pk

Require the bdirty() be called at splbio() and with the buffer interlock held.
This is essentially just a helper routine called from biodone() through
ffs softdep's I/O completion, to re-queue the buffer.


# 1.87 05-Feb-2003 pk

Make the buffer cache code MP-safe.


# 1.86 18-Jan-2003 thorpej

Merge the nathanw_sa branch.


Revision tags: nathanw_sa_before_merge fvdl_fs64_base gmcgarry_ctxsw_base gmcgarry_ucred_base nathanw_sa_base kqueue-aftermerge kqueue-beforemerge kqueue-base
# 1.85 06-Sep-2002 gehenna

Merge the gehenna-devsw branch into the trunk.

This merge changes the device switch tables from static array to
dynamically generated by config(8).

- All device switches is defined as a constant structure in device drivers.

- The new grammer ``device-major'' is introduced to ``files''.

device-major <prefix> char <num> [block <num>] [<rules>]

- All device major numbers must be listed up in port dependent majors.<arch>
by using this grammer.

- Added the new naming convention.
The name of the device switch must be <prefix>_[bc]devsw for auto-generation
of device switch tables.

- The backward compatibility of loading block/character device
switch by LKM framework is broken. This is necessary to convert
from block/character device major to device name in runtime and vice versa.

- The restriction to assign device major by LKM is completely removed.
We don't need to reserve LKM entries for dynamic loading of device switch.

- In compile time, device major numbers list is packed into the kernel and
the LKM framework will refer it to assign device major number dynamically.


# 1.84 04-Sep-2002 matt

Use the queue macros from <sys/queue.h> instead of referring to the queue
members directly. Use *_FOREACH whenever possible.


Revision tags: gehenna-devsw-base
# 1.83 30-Aug-2002 hannken

Remove the old device buffer queue interface.

Approved by: Jason R. Thorpe <thorpej@wasabisystems.com>


# 1.82 25-Aug-2002 thorpej

Make nbuf, nswbuf, and bufpages unsigned. Make all operations on these
variables unsigned, and update places where their values are printed.


Revision tags: netbsd-1-6-PATCH002-RELEASE netbsd-1-6-PATCH002 netbsd-1-6-PATCH002-RC4 netbsd-1-6-PATCH002-RC3 netbsd-1-6-PATCH002-RC2 netbsd-1-6-PATCH002-RC1 netbsd-1-6-PATCH001 netbsd-1-6-PATCH001-RELEASE netbsd-1-6-PATCH001-RC3 netbsd-1-6-PATCH001-RC2 netbsd-1-6-PATCH001-RC1 netbsd-1-6-RELEASE netbsd-1-6-RC3 netbsd-1-6-RC2 netbsd-1-6-RC1 netbsd-1-6-base
# 1.81 12-May-2002 matt

branches: 1.81.2;
Eliminate commons.


Revision tags: eeh-devprop-base
# 1.80 16-Mar-2002 chs

fix bread() to return errors from reading past the end of the device.
back in rev. 1.51, bread() and breadn() were changed to assume that
if B_DONE is set on a buffer returned by bio_doread(), that the buffer
must have already been in the cache, and thus the overall bread() should
return success. but if the requested buffer is not in the cache and
is past the end of the device, bounds_check_with_label() will set B_ERROR
on the buffer and the caller will call biodone(), which will cause bread()
to think the buffer was already in the cache and thus return success.
to fix this, undo rev. 1.51 and instead have biowait() treat both B_DONE
and B_DELWRI as indicators that it doesn't need to sleep waiting for an
i/o to complete.


Revision tags: newlock-base
# 1.79 08-Mar-2002 thorpej

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:

* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.

From art@openbsd.org.


Revision tags: ifpoll-base
# 1.78 10-Feb-2002 chs

getblk()'s "block size invariant" isn't valid for VBLK vnodes
since bounds_check_with_label() will truncate a buffer that crosses
the end of the partition. adjust the assertion to account for this.
fixes PRs 7938, 12156, 12698, 13076, 13210 and 13288.


Revision tags: thorpej-mips-cache-base
# 1.77 12-Nov-2001 lukem

add RCSIDs


Revision tags: thorpej-devvp-base3 thorpej-devvp-base2 post-chs-ubcperf pre-chs-ubcperf thorpej-devvp-base thorpej_scsipi_beforemerge thorpej_scsipi_nbase thorpej_scsipi_base
# 1.76 01-Apr-2001 chs

branches: 1.76.2; 1.76.4; 1.76.6;
in bwrite(), when deciding whether to convert sync writes into delayed writes,
examine the filesystem contained in a block device rather than the filesystem
containing the block device vnode. fixes PR 12484.


# 1.75 10-Mar-2001 chs

in getnewbuf(), when we need to write a buffer before reusing it,
return NULL instead of restarting the loop since we might sleep
while starting the i/o. this tells getblk() to check if someone else
created the buffer while we slept. from OpenBSD.


# 1.74 13-Dec-2000 jdolecek

branches: 1.74.2;
this doesn't need <sys/trace.h>


# 1.73 27-Nov-2000 chs

Initial integration of the Unified Buffer Cache project.


# 1.72 18-Nov-2000 simonb

Don't use alloca() - breaks compile on alpha (alloca is not prototyped
anywhere).


# 1.71 14-Nov-2000 thorpej

NBPG -> PAGE_SIZE


# 1.70 08-Nov-2000 ad

Update for hashinit() change.


# 1.69 08-Nov-2000 chs

use round_page(...) instead of roundup(..., NBPG).


# 1.68 27-Jun-2000 mrg

remove include of <vm/vm.h>


Revision tags: netbsd-1-5-RELEASE netbsd-1-5-BETA2 netbsd-1-5-BETA netbsd-1-5-ALPHA2 netbsd-1-5-base minoura-xpg4dl-base
# 1.67 12-Apr-2000 fvdl

branches: 1.67.4;
Fix from Ethan Solomita <ethan@geocast.com> to avoid a livelock problem
where the buffer cache code would be recycling B_AGE buffers with
dependencies.


# 1.66 30-Mar-2000 augustss

Get rid of register declarations.


# 1.65 14-Feb-2000 thorpej

One small piece from UBC: create a pool for I/O buffers. One small piece
not from UBC: make physio use it instead of its own home-grown thing.


Revision tags: chs-ubc2-newbase
# 1.64 07-Feb-2000 thorpej

Fix a bug in disksort_*() which caused non-optimal ordering when multiple
active partitions were on a single spindle. Add a b_rawblkno member to
struct buf which contains the non-partition-relative block number to sort
by.


# 1.63 21-Jan-2000 thorpej

- Implement bowrite() -- perform an asynchronous, ordered write.


Revision tags: wrstuden-devbsize-19991221 wrstuden-devbsize-base
# 1.62 03-Dec-1999 ragge

First round of discarding the CL* macros.


# 1.61 26-Nov-1999 fvdl

Clear B_AGE in bdirty(), this buffer must go through the LRU again
to be back on the AGE queue. Otherwise we risk recycling a set
of buffers with (soft) dependencies on the AGE list, which may
last forever if the vnode they belong to is locked (i.e. the syncer
won't get to the buffers they depend on, so their dependencies
are never flushed).


# 1.60 23-Nov-1999 fvdl

Be more careful to block bio interrupts for some data structures. There
were at least a few missed cases where vp->v_{clean,dirty}blkhd were
unprotected since the softdep/trickle sync merge.


# 1.59 15-Nov-1999 fvdl

Add Kirk McKusick's soft updates code to the trunk. Not enabled by
default, as the copyright on the main file (ffs_softdep.c) is such
that is has been put into gnusrc. options SOFTDEP will pull this
in. This code also contains the trickle syncer.

Bump version number to 1.4O


Revision tags: netbsd-1-4-PATCH003 netbsd-1-4-PATCH002 kame_141_19991130 comdex-fall-1999-base fvdl-softdep-base netbsd-1-4-PATCH001 kame_14_19990705 kame_14_19990628 chs-ubc2-base netbsd-1-4-RELEASE netbsd-1-4-base kenh-if-detach-base
# 1.58 09-Nov-1998 mycroft

branches: 1.58.6; 1.58.10; 1.58.12; 1.58.16;
GC the B_CACHE bit.


Revision tags: chs-ubc-base
# 1.57 27-Oct-1998 mycroft

branches: 1.57.2;
Several things:
* Change the usage of B_DONE so that it is only set when a buffer is in sync
with the data on disk.
* If a buffer is being waited for, don't put it on the age queue.
* Make sure to clear B_DONE when pages are stolen from a buffer.
* Make sure to clear B_CACHE after each use.
* If we find a buffer for the block we want with valid data, but it is too
small, panic. (This isn't supposed to happen.)
Fixes potential file corruption problems with clustering.


# 1.56 13-Aug-1998 eeh

Merge paddr_t changes into the main branch.


# 1.55 04-Aug-1998 perry

Abolition of bcopy, ovbcopy, bcmp, and bzero, phase one.
bcopy(x, y, z) -> memcpy(y, x, z)
ovbcopy(x, y, z) -> memmove(y, x, z)
bcmp(x, y, z) -> memcmp(x, y, z)
bzero(x, y) -> memset(x, 0, y)


# 1.54 31-Jul-1998 perry

fix sizeofs so they comply with the KNF style guide. yes, it is pedantic.


Revision tags: eeh-paddr_t-base
# 1.53 07-Feb-1998 chs

branches: 1.53.2;
add flags arg to hashinit(), to pass to malloc().


Revision tags: netbsd-1-3-PATCH002 netbsd-1-3-PATCH001 netbsd-1-3-RELEASE netbsd-1-3-BETA netbsd-1-3-base thorpej-signal-base marc-pcmcia-bp marc-pcmcia-base
# 1.52 08-Jul-1997 pk

branches: 1.52.6;
Check `b_dev' field in bdwrite() before using it as an index into bdevsw[].
`b_dev' value of NODEV happens and is normal if the buffer is on its way
to the underlying device strategy function for the first time.

Also, MFS sillily uses a major device number (255) which cannot be used
to index bdevsw[]. Check marked with XXXs.


# 1.51 08-Jul-1997 pk

In bread() and breadn(): if getblk() returns a DELWRI buffer, don't
call biowait() but return `success' immediately. We can return `success'
because buffers with recorded errors are not returned by getblk().
(Takes care of PR#3694).


# 1.50 09-Apr-1997 mycroft

Fix two performance issues:
* When a delayed write buffer falls off the LRU queue, arrange for it to go on
the AGE queue after being flushed out to disk.
* When a delayed write buffer is synced, leave it in its relative position in
the LRU queue.


Revision tags: is-newarp-before-merge is-newarp-base
# 1.49 15-Oct-1996 cgd

curproc was being used directly for ru_{in,ou}block counting. Instead
of using it directly, use a local, and set that local to be curproc
if curproc is not NULL else a pointer to process 0's proc struct.
If syncing disks while handling a panic that occurred while 'curproc'
was NULL, the old code would dereference NULL and die.


# 1.48 13-Oct-1996 christos

backout previous kprintf change


# 1.47 10-Oct-1996 christos

printf -> kprintf, sprintf -> ksprintf


# 1.46 18-Jun-1996 mycroft

In the sync case of bwrite(), move the accounting earlier so that so that the
delayed write is logically converted to a sync write, mirroring the async case.
In bdwrite(), move the tape case earlier to avoid needless reassignbuf()s.


# 1.45 17-Jun-1996 pk

Call reassignbuf() at splbio in bdwrite().


# 1.44 11-Jun-1996 pk

Protect vnode when updating for started IO on buffers.


Revision tags: netbsd-1-2-base
# 1.43 22-Apr-1996 christos

branches: 1.43.4;
remove include of <sys/cpu.h>


# 1.42 18-Feb-1996 fvdl

Changes for NVSv3 code: pull in more NFS include files into kern_time.c
to get types right (overkill for just one function call, but oh well).
Clear B_NEEDCOMMIT in bdwrite().


# 1.41 09-Feb-1996 christos

More proto fixes


# 1.40 04-Feb-1996 christos

First pass at prototyping


Revision tags: netbsd-1-1-PATCH001 netbsd-1-1-RELEASE netbsd-1-1-base
# 1.39 02-Aug-1995 cgd

fix bug pointed out by, and do the cleanup suggested by
Alasdair Baird <alasdair@wildcat.demon.co.uk>. From pr 1301.


# 1.38 12-Jul-1995 cgd

bdwrite() should upgrade writes to tape devices by sending them to
bawrite(). it's logically more correct (doesn't return an error code,
because it's async; bdwrite is also async), it still writes things
in-order, it makes sure the proper accountins is done (see the
wasdelayed cases in bwrite()), and it allows writes to vnodes on volumes
mountd with the MNT_ASYNC to be converted into delayed writes the way
God, err, Kirk intended. Convert synchronous bwrite()s on MNT_ASYNC
file systems to delayed writes.


# 1.37 12-Jul-1995 cgd

fix long-standing XXX in getblk(): NFS does funky things (somewhat
explained in comments), which can cause a race condition. amazingly,
the _only_ time i've ever seen or heard of this problem was in some
comments and sources by Rick Macklem, and when running against the
a DEC OSF/1 NFS server running on an Alpha.


# 1.36 20-Jun-1995 cgd

fix pr 1128; change vfs_bufstats defn from DIAGNOSTIC - > DEBUG


# 1.35 10-Apr-1995 mycroft

Use the new d_type field.


# 1.34 22-Nov-1994 mycroft

Various code rearrangement.


# 1.33 30-Oct-1994 cgd

be more careful with types, also pull in headers where necessary.


# 1.32 29-Aug-1994 mycroft

Patch to fix `reassignbuf: NULL' messages, from cgd.


Revision tags: netbsd-1-0-base
# 1.31 03-Jul-1994 cgd

branches: 1.31.2;
light clean up, use some macros


# 1.30 29-Jun-1994 cgd

New RCS ID's, take two. they're more aesthecially pleasant, and use 'NetBSD'


# 1.29 14-Jun-1994 mycroft

Minor cleanup.


# 1.28 08-Jun-1994 mycroft

Update to 4.4-Lite fs code.


# 1.27 05-Jun-1994 cgd

minor type pointed out by Onno van der Linden


# 1.26 29-May-1994 mycroft

Clear more flags in brelse().


# 1.25 27-May-1994 mycroft

Cluster routines want 0-sized bufs.


# 1.24 19-May-1994 cgd

some paranoia, also, clean up spaces vs. tabs


# 1.23 18-May-1994 cgd

my two favorite reference books


# 1.22 18-May-1994 cgd

forgot the damned rcsid


# 1.21 18-May-1994 cgd

significant rework, to match 4.4-Lite interfaces, and to comment more
closely from Bach.


# 1.20 17-May-1994 cgd

notdef out vn_bwrite out, for now, so that kernels compile, until new fs stuff


# 1.19 17-May-1994 mycroft

Implement new functions for 4.4-Lite file systems, and some general cleanup.


# 1.18 28-Apr-1994 cgd

rearrange some splfoo


# 1.17 26-Apr-1994 cgd

clean up a little bit, and minor optimization...


# 1.16 21-Apr-1994 cgd

Convert mount, vnode, and buf structs to use <sys/queue.h>. Also,
some knf and structure frobbing to do along with it.


# 1.15 09-Mar-1994 ws

Make FFS optional


# 1.14 27-Jan-1994 cgd

get rid of jolitz hack, and add panic() where appropriate


# 1.13 18-Dec-1993 mycroft

Canonicalize all #includes.


# 1.12 12-Nov-1993 cgd

new specfs.h and fifo.h locations


# 1.11 26-Oct-1993 cgd

if you try to allocate a buffer larger than MAXBSIZE, panic.


# 1.10 19-Oct-1993 cgd

fix my last change; for some reason i thought that 'p' was defined
in these functions. use curproc instead.


# 1.9 19-Oct-1993 cgd

pay for block i/o. slightly different than how done by Mark Tinguely.


# 1.8 06-Oct-1993 cgd

changed the Debugger() call, which not all kernels have, to panic(),
but only when DIAGNOSTIC is defined.


# 1.7 04-Oct-1993 cgd

new, improved, and rationally-implemented vfs_bio. no more serious
structural changes should happen, as it now does the right thing
w.r.t. buffer resizing and having lots of buffers vs. relatively
little buffer space. Ports can now "do the standard thing", re:
nbuf and bufpages, which is make nbuf = bufpages by default.


# 1.6 01-Oct-1993 cgd

patch from Christoph Badura <bad@flatlin.ka.sub.org> to fix credential
use by read-ahead blocks. This fixes those weak NFS authentication
messages, and allows us to use BSDI NFS servers again...


# 1.5 29-Sep-1993 cgd

convert to use the buffers which are (now) statically allocated at
startup in machdep.c... buffers are now *never* allocated after boot.
currently, the limitation that says bufpages must cover nbuf*MAXBSIZE
still exists, but this is one step closer to removing that limitation.


Revision tags: magnum-base
# 1.4 21-Sep-1993 cgd

rewrite biodone to the spec in the daemon book, and to account for
the fact that buffers with B_CALL set shouldn't be brelse()'d.


# 1.3 07-Aug-1993 cgd

branches: 1.3.2;
merge in changes from netbsd-0-9-ALPHA2


# 1.2 24-Jul-1993 ws

Use all of freebufspace


Revision tags: netbsd-0-9-ALPHA netbsd-0-9-base
# 1.1 19-Jul-1993 cgd

branches: 1.1.1; 1.1.2;
replace jolitz's vfs__bio with a better one from CMU via mw.
so, replace vfs__bio, and deal with attendant changes.