History log of /freebsd-current/sys/sys/bio.h
Revision Date Author Comments
# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 2ff63af9 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .h pattern

Remove /^\s*\*+\s*\$FreeBSD\$.*$\n/


# 965e9195 24-Apr-2022 Warner Losh <imp@FreeBSD.org>

bio: Add the speedup flags to PRINT_BIO_FLAGS

The BIO_SPEEDUP_WRITE and BIO_SPEEDUP_TRIM bits are part of the flags
word, so print them as such.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D35035


# 077ab5bd 23-Feb-2022 Warner Losh <imp@FreeBSD.org>

bio: make _bio_cflags match bio_cflags

While none of the in-tree geoms that modify the bio_cflags use the top
half of the word, were they to do this in the future, we'd hit false
positives for DIAGNOSTIC kernels. Make both uint16_t.

MFC After: 1 week
Sponsored by: Netflix


# 8587d752 16-Oct-2021 Wuyang Chung <wy-chung@outlook.com>

Correct the name of the second parameter of biowait to wmesg

This parameter is passed directly to msleep, and the name of the msleep
parameter is wmesg. Make them match.

Pull Request: https://github.com/freebsd/freebsd-src/pull/557


# c6213bef 28-Sep-2021 Gleb Smirnoff <glebius@FreeBSD.org>

Add flag BIO_SWAP to mark IOs that are associated with swap.

Submitted by: jtl
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D24400


# 86c06ff8 29-Dec-2019 Alexander Motin <mav@FreeBSD.org>

Remove GEOM_SCHED class and gsched tool.

This code was not actively maintained since it was introduced 10 years ago.
It lacks support for many later GEOM features, such as direct dispatch,
unmapped I/O, stripesize/stripeoffset, resize, etc. Plus it is the only
remaining use of GEOM nstart/nend request counters, used there to implement
live insertion/removal, questionable by itself. Plus, as number of people
commented, GEOM is not the best place for I/O scheduler, since it has
limited information about layers both above and below it, required for
efficient scheduling. Plus with the modern shift to SSDs there is just no
more significant need for this kind of scheduling.

Approved by: imp, phk, luigi
Relnotes: yes


# b182c792 16-Dec-2019 Warner Losh <imp@FreeBSD.org>

Add BIO_SPEEDUP

Add BIO_SPEEDUP bio command and g_io_speedup wrapper. It tells the
lower layers that the upper layers are dealing with some shortage
(dirty pages and/or disk blocks). The lower layers should do what they
can to speed up anything that's been delayed.

The first use will be to tell the CAM I/O scheduler that any TRIM
shaping should be short-circuited because the system needs
blocks. We'll also call it when there's too many resources used by
UFS.

Reviewed by: kirk, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D18351


# dab83bd1 25-Jan-2019 Kirk McKusick <mckusick@FreeBSD.org>

Add printing of b_ioflags to DDB `show buffer' command.

Sponsored by: Netflix


# a971acbc 13-Jun-2018 Warner Losh <imp@FreeBSD.org>

Implement a 'car limit' for bioq.

Allow one to implement a 'car limit' for
bioq_disksort. debug.bioq_batchsize sets the size of car limit. Every
time we queue that many requests, we start over so that we limit the
latency for requests when the software queue depths are large. A value
of '0', the default, means to revert to the old behavior.

Sponsored by: Netflix


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# fbbd9655 28-Feb-2017 Warner Losh <imp@FreeBSD.org>

Renumber copyright clause 4

Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96


# 8532d381 31-Oct-2016 Conrad Meyer <cem@FreeBSD.org>

Add BUF_TRACKING and FULL_BUF_TRACKING buffer debugging

Upstream the BUF_TRACKING and FULL_BUF_TRACKING buffer debugging code.
This can be handy in tracking down what code touched hung bios and bufs
last. The full history is especially useful, but adds enough bloat that
it shouldn't be enabled in release builds.

Function names (or arbitrary string constants) are tracked in a
fixed-size ring in bufs. Bios gain a pointer to the upper buf for
tracking. SCSI CCBs gain a pointer to the upper bio for tracking.

Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8366


# bbdd6614 19-Sep-2016 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove unused bio_taskqueue().

MFC after: 1 month


# 9a6844d5 19-May-2016 Kenneth D. Merry <ken@FreeBSD.org>

Add support for managing Shingled Magnetic Recording (SMR) drives.

This change includes support for SCSI SMR drives (which conform to the
Zoned Block Commands or ZBC spec) and ATA SMR drives (which conform to
the Zoned ATA Command Set or ZAC spec) behind SAS expanders.

This includes full management support through the GEOM BIO interface, and
through a new userland utility, zonectl(8), and through camcontrol(8).

This is now ready for filesystems to use to detect and manage zoned drives.
(There is no work in progress that I know of to use this for ZFS or UFS, if
anyone is interested, let me know and I may have some suggestions.)

Also, improve ATA command passthrough and dispatch support, both via ATA
and ATA passthrough over SCSI.

Also, add support to camcontrol(8) for the ATA Extended Power Conditions
feature set. You can now manage ATA device power states, and set various
idle time thresholds for a drive to enter lower power states.

Note that this change cannot be MFCed in full, because it depends on
changes to the struct bio API that break compatilibity. In order to
avoid breaking the stable API, only changes that don't touch or depend on
the struct bio changes can be merged. For example, the camcontrol(8)
changes don't depend on the new bio API, but zonectl(8) and the probe
changes to the da(4) and ada(4) drivers do depend on it.

Also note that the SMR changes have not yet been tested with an actual
SCSI ZBC device, or a SCSI to ATA translation layer (SAT) that supports
ZBC to ZAC translation. I have not yet gotten a suitable drive or SAT
layer, so any testing help would be appreciated. These changes have been
tested with Seagate Host Aware SATA drives attached to both SAS and SATA
controllers. Also, I do not have any SATA Host Managed devices, and I
suspect that it may take additional (hopefully minor) changes to support
them.

Thanks to Seagate for supplying the test hardware and answering questions.

sbin/camcontrol/Makefile:
Add epc.c and zone.c.

sbin/camcontrol/camcontrol.8:
Document the zone and epc subcommands.

sbin/camcontrol/camcontrol.c:
Add the zone and epc subcommands.

Add auxiliary register support to build_ata_cmd(). Make sure to
set the CAM_ATAIO_NEEDRESULT, CAM_ATAIO_DMA, and CAM_ATAIO_FPDMA
flags as appropriate for ATA commands.

Add a new get_ata_status() function to parse ATA result from SCSI
sense descriptors (for ATA passthrough over SCSI) and ATA I/O
requests.

sbin/camcontrol/camcontrol.h:
Update the build_ata_cmd() prototype

Add get_ata_status(), zone(), and epc().

sbin/camcontrol/epc.c:
Support for ATA Extended Power Conditions features. This includes
support for all features documented in the ACS-4 Revision 12
specification from t13.org (dated February 18, 2016).

The EPC feature set allows putting a drive into a power power mode
immediately, or setting timeouts so that the drive will
automatically enter progressively lower power states after various
idle times.

sbin/camcontrol/fwdownload.c:
Update the firmware download code for the new build_ata_cmd()
arguments.

sbin/camcontrol/zone.c:
Implement support for Shingled Magnetic Recording (SMR) drives
via SCSI Zoned Block Commands (ZBC) and ATA Zoned Device ATA
Command Set (ZAC).

These specs were developed in concert, and are functionally
identical. The primary differences are due to SCSI and ATA
differences. (SCSI is big endian, ATA is little endian, for
example.)

This includes support for all commands defined in the ZBC and
ZAC specs.

sys/cam/ata/ata_all.c:
Decode a number of additional ATA command names in ata_op_string().

Add a new CCB building function, ata_read_log().

Add ata_zac_mgmt_in() and ata_zac_mgmt_out() CCB building
functions. These support both DMA and NCQ encapsulation.

sys/cam/ata/ata_all.h:
Add prototypes for ata_read_log(), ata_zac_mgmt_out(), and
ata_zac_mgmt_in().

sys/cam/ata/ata_da.c:
Revamp the ada(4) driver to support zoned devices.

Add four new probe states to gather information needed for zone
support.

Add a new adasetflags() function to avoid duplication of large
blocks of flag setting between the async handler and register
functions.

Add new sysctl variables that describe zone support and paramters.

Add support for the new BIO_ZONE bio, and all of its subcommands:
DISK_ZONE_OPEN, DISK_ZONE_CLOSE, DISK_ZONE_FINISH, DISK_ZONE_RWP,
DISK_ZONE_REPORT_ZONES, and DISK_ZONE_GET_PARAMS.

sys/cam/scsi/scsi_all.c:
Add command descriptions for the ZBC IN/OUT commands.

Add descriptions for ZBC Host Managed devices.

Add a new function, scsi_ata_pass() to do ATA passthrough over
SCSI. This will eventually replace scsi_ata_pass_16() -- it
can create the 12, 16, and 32-byte variants of the ATA
PASS-THROUGH command, and supports setting all of the
registers defined as of SAT-4, Revision 5 (March 11, 2016).

Change scsi_ata_identify() to use scsi_ata_pass() instead of
scsi_ata_pass_16().

Add a new scsi_ata_read_log() function to facilitate reading
ATA logs via SCSI.

sys/cam/scsi/scsi_all.h:
Add the new ATA PASS-THROUGH(32) command CDB. Add extended and
variable CDB opcodes.

Add Zoned Block Device Characteristics VPD page.

Add ATA Return SCSI sense descriptor.

Add prototypes for scsi_ata_read_log() and scsi_ata_pass().

sys/cam/scsi/scsi_da.c:
Revamp the da(4) driver to support zoned devices.

Add five new probe states, four of which are needed for ATA
devices.

Add five new sysctl variables that describe zone support and
parameters.

The da(4) driver supports SCSI ZBC devices, as well as ATA ZAC
devices when they are attached via a SCSI to ATA Translation (SAT)
layer. Since ZBC -> ZAC translation is a new feature in the T10
SAT-4 spec, most SATA drives will be supported via ATA commands
sent via the SCSI ATA PASS-THROUGH command. The da(4) driver will
prefer the ZBC interface, if it is available, for performance
reasons, but will use the ATA PASS-THROUGH interface to the ZAC
command set if the SAT layer doesn't support translation yet.
As I mentioned above, ZBC command support is untested.

Add support for the new BIO_ZONE bio, and all of its subcommands:
DISK_ZONE_OPEN, DISK_ZONE_CLOSE, DISK_ZONE_FINISH, DISK_ZONE_RWP,
DISK_ZONE_REPORT_ZONES, and DISK_ZONE_GET_PARAMS.

Add scsi_zbc_in() and scsi_zbc_out() CCB building functions.

Add scsi_ata_zac_mgmt_out() and scsi_ata_zac_mgmt_in() CCB/CDB
building functions. Note that these have return values, unlike
almost all other CCB building functions in CAM. The reason is
that they can fail, depending upon the particular combination
of input parameters. The primary failure case is if the user
wants NCQ, but fails to specify additional CDB storage. NCQ
requires using the 32-byte version of the SCSI ATA PASS-THROUGH
command, and the current CAM CDB size is 16 bytes.

sys/cam/scsi/scsi_da.h:
Add ZBC IN and ZBC OUT CDBs and opcodes.

Add SCSI Report Zones data structures.

Add scsi_zbc_in(), scsi_zbc_out(), scsi_ata_zac_mgmt_out(), and
scsi_ata_zac_mgmt_in() prototypes.

sys/dev/ahci/ahci.c:
Fix SEND / RECEIVE FPDMA QUEUED in the ahci(4) driver.

ahci_setup_fis() previously set the top bits of the sector count
register in the FIS to 0 for FPDMA commands. This is okay for
read and write, because the PRIO field is in the only thing in
those bits, and we don't implement that further up the stack.

But, for SEND and RECEIVE FPDMA QUEUED, the subcommand is in that
byte, so it needs to be transmitted to the drive.

In ahci_setup_fis(), always set the the top 8 bits of the
sector count register. We need it in both the standard
and NCQ / FPDMA cases.

sys/geom/eli/g_eli.c:
Pass BIO_ZONE commands through the GELI class.

sys/geom/geom.h:
Add g_io_zonecmd() prototype.

sys/geom/geom_dev.c:
Add new DIOCZONECMD ioctl, which allows sending zone commands to
disks.

sys/geom/geom_disk.c:
Add support for BIO_ZONE commands.

sys/geom/geom_disk.h:
Add a new flag, DISKFLAG_CANZONE, that indicates that a given
GEOM disk client can handle BIO_ZONE commands.

sys/geom/geom_io.c:
Add a new function, g_io_zonecmd(), that handles execution of
BIO_ZONE commands.

Add permissions check for BIO_ZONE commands.

Add command decoding for BIO_ZONE commands.

sys/geom/geom_subr.c:
Add DDB command decoding for BIO_ZONE commands.

sys/kern/subr_devstat.c:
Record statistics for REPORT ZONES commands. Note that the
number of bytes transferred for REPORT ZONES won't quite match
what is received from the harware. This is because we're
necessarily counting bytes coming from the da(4) / ada(4) drivers,
which are using the disk_zone.h interface to communicate up
the stack. The structure sizes it uses are slightly different
than the SCSI and ATA structure sizes.

sys/sys/ata.h:
Add many bit and structure definitions for ZAC, NCQ, and EPC
command support.

sys/sys/bio.h:
Convert the bio_cmd field to a straight enumeration. This will
yield more space for additional commands in the future. After
change r297955 and other related changes, this is now possible.
Converting to an enumeration will also prevent use as a bitmask
in the future.

sys/sys/disk.h:
Define the DIOCZONECMD ioctl.

sys/sys/disk_zone.h:
Add a new API for managing zoned disks. This is very close to
the SCSI ZBC and ATA ZAC standards, but uses integers in native
byte order instead of big endian (SCSI) or little endian (ATA)
byte arrays.

This is intended to offer to the complete feature set of the ZBC
and ZAC disk management without requiring the application developer
to include SCSI or ATA headers. We also use one set of headers
for ioctl consumers and kernel bio-level consumers.

sys/sys/param.h:
Bump __FreeBSD_version for sys/bio.h command changes, and inclusion
of SMR support.

usr.sbin/Makefile:
Add the zonectl utility.

usr.sbin/diskinfo/diskinfo.c
Add disk zoning capability to the 'diskinfo -v' output.

usr.sbin/zonectl/Makefile:
Add zonectl makefile.

usr.sbin/zonectl/zonectl.8
zonectl(8) man page.

usr.sbin/zonectl/zonectl.c
The zonectl(8) utility. This allows managing SCSI or ATA zoned
disks via the disk_zone.h API. You can report zones, reset write
pointers, get parameters, etc.

Sponsored by: Spectra Logic
Differential Revision: https://reviews.freebsd.org/D6147
Reviewed by: wblock (documentation)


# 9a8fa125 13-Apr-2016 Warner Losh <imp@FreeBSD.org>

Bump bio_cmd and bio_*flags from 8 bits to 16.

Differential Revision: https://reviews.freebsd.org/D5784


# a9934668 03-Dec-2015 Kenneth D. Merry <ken@FreeBSD.org>

Add asynchronous command support to the pass(4) driver, and the new
camdd(8) utility.

CCBs may be queued to the driver via the new CAMIOQUEUE ioctl, and
completed CCBs may be retrieved via the CAMIOGET ioctl. User
processes can use poll(2) or kevent(2) to get notification when
I/O has completed.

While the existing CAMIOCOMMAND blocking ioctl interface only
supports user virtual data pointers in a CCB (generally only
one per CCB), the new CAMIOQUEUE ioctl supports user virtual and
physical address pointers, as well as user virtual and physical
scatter/gather lists. This allows user applications to have more
flexibility in their data handling operations.

Kernel memory for data transferred via the queued interface is
allocated from the zone allocator in MAXPHYS sized chunks, and user
data is copied in and out. This is likely faster than the
vmapbuf()/vunmapbuf() method used by the CAMIOCOMMAND ioctl in
configurations with many processors (there are more TLB shootdowns
caused by the mapping/unmapping operation) but may not be as fast
as running with unmapped I/O.

The new memory handling model for user requests also allows
applications to send CCBs with request sizes that are larger than
MAXPHYS. The pass(4) driver now limits queued requests to the I/O
size listed by the SIM driver in the maxio field in the Path
Inquiry (XPT_PATH_INQ) CCB.

There are some things things would be good to add:

1. Come up with a way to do unmapped I/O on multiple buffers.
Currently the unmapped I/O interface operates on a struct bio,
which includes only one address and length. It would be nice
to be able to send an unmapped scatter/gather list down to
busdma. This would allow eliminating the copy we currently do
for data.

2. Add an ioctl to list currently outstanding CCBs in the various
queues.

3. Add an ioctl to cancel a request, or use the XPT_ABORT CCB to do
that.

4. Test physical address support. Virtual pointers and scatter
gather lists have been tested, but I have not yet tested
physical addresses or scatter/gather lists.

5. Investigate multiple queue support. At the moment there is one
queue of commands per pass(4) device. If multiple processes
open the device, they will submit I/O into the same queue and
get events for the same completions. This is probably the right
model for most applications, but it is something that could be
changed later on.

Also, add a new utility, camdd(8) that uses the asynchronous pass(4)
driver interface.

This utility is intended to be a basic data transfer/copy utility,
a simple benchmark utility, and an example of how to use the
asynchronous pass(4) interface.

It can copy data to and from pass(4) devices using any target queue
depth, starting offset and blocksize for the input and ouptut devices.
It currently only supports SCSI devices, but could be easily extended
to support ATA devices.

It can also copy data to and from regular files, block devices, tape
devices, pipes, stdin, and stdout. It does not support queueing
multiple commands to any of those targets, since it uses the standard
read(2)/write(2)/writev(2)/readv(2) system calls.

The I/O is done by two threads, one for the reader and one for the
writer. The reader thread sends completed read requests to the
writer thread in strictly sequential order, even if they complete
out of order. That could be modified later on for random I/O patterns
or slightly out of order I/O.

camdd(8) uses kqueue(2)/kevent(2) to get I/O completion events from
the pass(4) driver and also to send request notifications internally.

For pass(4) devcies, camdd(8) uses a single buffer (CAM_DATA_VADDR)
per CAM CCB on the reading side, and a scatter/gather list
(CAM_DATA_SG) on the writing side. In addition to testing both
interfaces, this makes any potential reblocking of I/O easier. No
data is copied between the reader and the writer, but rather the
reader's buffers are split into multiple I/O requests or combined
into a single I/O request depending on the input and output blocksize.

For the file I/O path, camdd(8) also uses a single buffer (read(2),
write(2), pread(2) or pwrite(2)) on reads, and a scatter/gather list
(readv(2), writev(2), preadv(2), pwritev(2)) on writes.

Things that would be nice to do for camdd(8) eventually:

1. Add support for I/O pattern generation. Patterns like all
zeros, all ones, LBA-based patterns, random patterns, etc. Right
Now you can always use /dev/zero, /dev/random, etc.

2. Add support for a "sink" mode, so we do only reads with no
writes. Right now, you can use /dev/null.

3. Add support for automatic queue depth probing, so that we can
figure out the right queue depth on the input and output side
for maximum throughput. At the moment it defaults to 6.

4. Add support for SATA device passthrough I/O.

5. Add support for random LBAs and/or lengths on the input and
output sides.

6. Track average per-I/O latency and busy time. The busy time
and latency could also feed in to the automatic queue depth
determination.

sys/cam/scsi/scsi_pass.h:
Define two new ioctls, CAMIOQUEUE and CAMIOGET, that queue
and fetch asynchronous CAM CCBs respectively.

Although these ioctls do not have a declared argument, they
both take a union ccb pointer. If we declare a size here,
the ioctl code in sys/kern/sys_generic.c will malloc and free
a buffer for either the CCB or the CCB pointer (depending on
how it is declared). Since we have to keep a copy of the
CCB (which is fairly large) anyway, having the ioctl malloc
and free a CCB for each call is wasteful.

sys/cam/scsi/scsi_pass.c:
Add asynchronous CCB support.

Add two new ioctls, CAMIOQUEUE and CAMIOGET.

CAMIOQUEUE adds a CCB to the incoming queue. The CCB is
executed immediately (and moved to the active queue) if it
is an immediate CCB, but otherwise it will be executed
in passstart() when a CCB is available from the transport layer.

When CCBs are completed (because they are immediate or
passdone() if they are queued), they are put on the done
queue.

If we get the final close on the device before all pending
I/O is complete, all active I/O is moved to the abandoned
queue and we increment the peripheral reference count so
that the peripheral driver instance doesn't go away before
all pending I/O is done.

The new passcreatezone() function is called on the first
call to the CAMIOQUEUE ioctl on a given device to allocate
the UMA zones for I/O requests and S/G list buffers. This
may be good to move off to a taskqueue at some point.
The new passmemsetup() function allocates memory and
scatter/gather lists to hold the user's data, and copies
in any data that needs to be written. For virtual pointers
(CAM_DATA_VADDR), the kernel buffer is malloced from the
new pass(4) driver malloc bucket. For virtual
scatter/gather lists (CAM_DATA_SG), buffers are allocated
from a new per-pass(9) UMA zone in MAXPHYS-sized chunks.
Physical pointers are passed in unchanged. We have support
for up to 16 scatter/gather segments (for the user and
kernel S/G lists) in the default struct pass_io_req, so
requests with longer S/G lists require an extra kernel malloc.

The new passcopysglist() function copies a user scatter/gather
list to a kernel scatter/gather list. The number of elements
in each list may be different, but (obviously) the amount of data
stored has to be identical.

The new passmemdone() function copies data out for the
CAM_DATA_VADDR and CAM_DATA_SG cases.

The new passiocleanup() function restores data pointers in
user CCBs and frees memory.

Add new functions to support kqueue(2)/kevent(2):

passreadfilt() tells kevent whether or not the done
queue is empty.

passkqfilter() adds a knote to our list.

passreadfiltdetach() removes a knote from our list.

Add a new function, passpoll(), for poll(2)/select(2)
to use.

Add devstat(9) support for the queued CCB path.

sys/cam/ata/ata_da.c:
Add support for the BIO_VLIST bio type.

sys/cam/cam_ccb.h:
Add a new enumeration for the xflags field in the CCB header.
(This doesn't change the CCB header, just adds an enumeration to
use.)

sys/cam/cam_xpt.c:
Add a new function, xpt_setup_ccb_flags(), that allows specifying
CCB flags.

sys/cam/cam_xpt.h:
Add a prototype for xpt_setup_ccb_flags().

sys/cam/scsi/scsi_da.c:
Add support for BIO_VLIST.

sys/dev/md/md.c:
Add BIO_VLIST support to md(4).

sys/geom/geom_disk.c:
Add BIO_VLIST support to the GEOM disk class. Re-factor the I/O size
limiting code in g_disk_start() a bit.

sys/kern/subr_bus_dma.c:
Change _bus_dmamap_load_vlist() to take a starting offset and
length.

Add a new function, _bus_dmamap_load_pages(), that will load a list
of physical pages starting at an offset.

Update _bus_dmamap_load_bio() to allow loading BIO_VLIST bios.
Allow unmapped I/O to start at an offset.

sys/kern/subr_uio.c:
Add two new functions, physcopyin_vlist() and physcopyout_vlist().

sys/pc98/include/bus.h:
Guard kernel-only parts of the pc98 machine/bus.h header with
#ifdef _KERNEL.

This allows userland programs to include <machine/bus.h> to get the
definition of bus_addr_t and bus_size_t.

sys/sys/bio.h:
Add a new bio flag, BIO_VLIST.

sys/sys/uio.h:
Add prototypes for physcopyin_vlist() and physcopyout_vlist().

share/man/man4/pass.4:
Document the CAMIOQUEUE and CAMIOGET ioctls.

usr.sbin/Makefile:
Add camdd.

usr.sbin/camdd/Makefile:
Add a makefile for camdd(8).

usr.sbin/camdd/camdd.8:
Man page for camdd(8).

usr.sbin/camdd/camdd.c:
The new camdd(8) utility.

Sponsored by: Spectra Logic
MFC after: 1 week


# ef04b888 23-Mar-2013 Will Andrews <will@FreeBSD.org>

Be more explicit about what each bio_cmd & bio_flags value means.

Reviewed by: ken (mentor)


# ee75e7de 19-Mar-2013 Konstantin Belousov <kib@FreeBSD.org>

Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.

The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.

When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.

Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.

The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.

Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.

In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.

By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.

Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# f03f7a0c 02-Sep-2010 Justin T. Gibbs <gibbs@FreeBSD.org>

Correct bioq_disksort so that bioq_insert_tail() offers barrier semantic.
Add the BIO_ORDERED flag for struct bio and update bio clients to use it.

The barrier semantics of bioq_insert_tail() were broken in two ways:

o In bioq_disksort(), an added bio could be inserted at the head of
the queue, even when a barrier was present, if the sort key for
the new entry was less than that of the last queued barrier bio.

o The last_offset used to generate the sort key for newly queued bios
did not stay at the position of the barrier until either the
barrier was de-queued, or a new barrier (which updates last_offset)
was queued. When a barrier is in effect, we know that the disk
will pass through the barrier position just before the
"blocked bios" are released, so using the barrier's offset for
last_offset is the optimal choice.

sys/geom/sched/subr_disk.c:
sys/kern/subr_disk.c:
o Update last_offset in bioq_insert_tail().

o Only update last_offset in bioq_remove() if the removed bio is
at the head of the queue (typically due to a call via
bioq_takefirst()) and no barrier is active.

o In bioq_disksort(), if we have a barrier (insert_point is non-NULL),
set prev to the barrier and cur to it's next element. Now that
last_offset is kept at the barrier position, this change isn't
strictly necessary, but since we have to take a decision branch
anyway, it does avoid one, no-op, loop iteration in the while
loop that immediately follows.

o In bioq_disksort(), bypass the normal sort for bios with the
BIO_ORDERED attribute and instead insert them into the queue
with bioq_insert_tail(). bioq_insert_tail() not only gives
the desired command order during insertion, but also provides
barrier semantics so that commands disksorted in the future
cannot pass the just enqueued transaction.

sys/sys/bio.h:
Add BIO_ORDERED as bit 4 of the bio_flags field in struct bio.

sys/cam/ata/ata_da.c:
sys/cam/scsi/scsi_da.c
Use an ordered command for SCSI/ATA-NCQ commands issued in
response to bios with the BIO_ORDERED flag set.

sys/cam/scsi/scsi_da.c
Use an ordered tag when issuing a synchronize cache command.

Wrap some lines to 80 columns.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
sys/geom/geom_io.c
Mark bios with the BIO_FLUSH command as BIO_ORDERED.

Sponsored by: Spectra Logic Corporation
MFC after: 1 month


# b16d0f5d 15-Dec-2009 Luigi Rizzo <luigi@FreeBSD.org>

MFC: expose only bio_cmd and bio_flags values to userland


# 37e20d0a 11-Dec-2009 Luigi Rizzo <luigi@FreeBSD.org>

only export bio_cmd and flags to userland (bio_cmd are
used by ggatectl, flags are potentially useful).
Other parts are internal kernel data structures and should
not be visible to userland.

No API change involved.

MFC after: 3 days


# 6231f75b 11-Jun-2009 Luigi Rizzo <luigi@FreeBSD.org>

As discussed in the devsummit, introduce two fields in the
struct bio to store classification information, and a hook
for classifier functions that can be called by g_io_request().

This code is from Fabio Checconi as part of his GSOC work.


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# c3618c65 31-Oct-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Add a new I/O request - BIO_FLUSH, which basically tells providers below to
flush their caches. For now will mostly be used by disks to flush their
write cache.

Sponsored by: home.pl


# 92ee312d 01-Mar-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Assert proper use of bio_caller1, bio_caller2, bio_cflags, bio_driver1,
bio_driver2 and bio_pflags fields.

Reviewed by: phk


# 12f6e63d 23-Nov-2005 Kris Kennaway <kris@FreeBSD.org>

Correct division by zero error in comment.


# 2afa4593 12-Jun-2005 Jeff Roberson <jeff@FreeBSD.org>

- switch_point is now unused. This doesn't break module binary compatability
since the structure is shrinking, not growing.


# 60727d8b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# bf484316 12-Dec-2004 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Add bioq_insert_head() function.

OK'd by: phk


# dcbd0fe5 30-Aug-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Add more KASSERTS and checks.


# d298f919 19-Aug-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Add bioq_takefirst().

If the bioq is empty, NULL is returned. Otherwise the front element
is removed and returned.

This can simplify locking in many drivers from:

lock()
bp = bioq_first(bq);
if (bp == NULL) {
unlock()
return
}
bioq_remove(bp, bq)
unlock
to:
lock()
bp = bioq_takefirst(bq);
unlock()
if (bp == NULL)
return;


# 51385a3c 04-Aug-2004 Pawel Jakub Dawidek <pjd@FreeBSD.org>

- Add two fields to bio structure: 'bio_cflags' which can be used by
consumer and 'bio_pflags' which can be used by provider.
- Remove BIO_FLAG1 and BIO_FLAG2 flags. From now on new fields should be
used for internal flags.
- Update g_bio(9) manual page.
- Update some comments.
- Update GEOM_MIRROR, which was the only one using BIO_FLAGs.

Idea from: phk
Reviewed by: phk


# 89c9c53d 16-Jun-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Do the dreaded s/dev_t/struct cdev */
Bump __FreeBSD_version accordingly.


# 82c6e879 06-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


# 5fcf4e43 28-Jan-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Bring back the geom_bioqueues, they _are_ a good idea.

ATA will uses these RSN.


# 3916828f 18-Oct-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Retire bio_blkno entirely.

bio_offset is the field drivers should use.
bio_pblkno remains as a convenient place to store the number of
the device drivers.


# 4cb4df48 18-Oct-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Make bioq_disksort() sort on the bio_offset field instead of bio_pblkno.


# 5f5a9022 12-Apr-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Retire the experimental bio_taskqueue(), it was not quite as usable as
hoped. It can be revived from here, should other drivers be able to
use it.


# b0fc6220 03-Apr-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Remove BIO_SETATTR from non-GEOM part of kernel as well.


# 332a2bf0 01-Apr-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Remove the #define for bioqdisksort(), it's no longer needed.


# af6ca7f4 31-Mar-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce bioq_flush() function.


# d2a0822e 30-Mar-2003 Poul-Henning Kamp <phk@FreeBSD.org>

retire the "busy" field in bioqueues, it's served it's purpose.


# d086f85a 30-Mar-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Preparation commit before I start on the bioqueue lockdown:

Collect all the bits of bioqueue handing in subr_disk.c, vfs_bio.c is big
enough as it is and disksort already lives in subr_disk.c.


# f0e185d7 11-Feb-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Implement a bio-taskqueue to reduce number of context switches in
disk I/O processing.

The intent is that the disk driver in its hardware interrupt
routine will simply schedule the bio on the task queue with
a routine to finish off whatever needs done.

The g_up thread will then schedule this routine, the likely
outcome of which is a biodone() which queues the bio on
g_up's regular queue where it will be picked up and processed.

Compared to the using the regular taskqueue, this saves one
contextswitch.

Change our scheduling of the g_up and g_down queues to be water-tight,
at the cost of breaking the userland regression test-shims.

Input and ideas from: scottl


# 801bb689 07-Feb-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Commit the correct copy of the g_stat structure.

Add debug.sizeof.g_stat sysctl.

Set the id field of the g_stat when we create consumers and providers.

Remove biocount from consumer, we will use the counters in the g_stat
structure instead. Replace one field which will need to be atomically
manipulated with two fields which will not (stat.nop and stat.nend).

Change add companion field to bio_children: bio_inbed for the exact
same reason.

Don't output the biocount in the confdot output.

Fix KASSERT in g_io_request().

Add sysctl kern.geom.collectstats defaulting to off.

Collect the following raw statistics conditioned on this sysctl:

for each consumer and provider {
total number of operations started.
total number of operations completed.
time last operation completed.
sum of idle-time.
for each of BIO_READ, BIO_WRITE and BIO_DELETE {
number of operations completed.
number of bytes completed.
number of ENOMEM errors.
number of other errors.
sum of transaction time.
}
}

API for getting hold of these statistics data not included yet.


# 936cc461 07-Feb-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Rename bio_linkage to the more obvious bio_parent.
Add bio_t0 timestamp, and include <sys/time.h> where needed


# ffa91881 02-Feb-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Add a bio_disk pointer for use between geom_disk and the device drivers.


# c57886c5 30-Jan-2003 Poul-Henning Kamp <phk@FreeBSD.org>

NO_GEOM cleanup: Rip out iodone_chain with a big smile on my face.


# 61e1c1f3 09-Oct-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Add a field for tallying the number of spawned bio's a bio has.

Rearrange and comment some GEOM related fields.

Sponsored by: DARPA & NAI Labs.


# 2382fb0a 20-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Make FreeBSD "struct disklabel" agnostic, step 312 of 723:

Rename bioqdisksort() to bioq_disksort().
Keep a #define around to avoid changing all diskdrivers right now.

Move it from subr_disklabel.c to subr_disk.c.
Move prototype from <sys/disklabel.h> to <sys/bio.h>

Sponsored by: DARPA and NAI Labs.


# 06f50eb1 14-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Remove the unused _bio_buf field. I can't even remember if this ever got
to be used in the first place.

Spotted by: bde


# 028e9e59 14-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Un-inline the non-trivial "trivial" bio* functions.
Untangle devstat_end_transaction_bio()


# c7143e71 13-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Oops, broke the build there. Uninline biodone() now that it is non-trivial.

Introduce biowait() function. Currently there is a race condition and the
mitigation is a timeout/retry. It is not obvious what kind of locking (if any)
is suitable for BIO_DONE, since the majority of users take are of this
themselves, and only a few places actually rely on the wakeup.

Sponsored by: DARPA & NAI Labs.


# 9f6f6b7c 13-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Make biodone() default to wakeup() on the struct bio if no bio_done
method was specified.


# f615ab28 05-Sep-2002 Bruce Evans <bde@FreeBSD.org>

Forward declare struct uio so that <sys/uio.h> isn't a prerequisite.

Removed bogus forward declarations of structs.


# 98b0c789 14-May-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Make daddr_t and u_daddr_t 64bits wide.
Retire daddr64_t and use daddr_t instead.

Sponsored by: DARPA & NAI Labs.


# 85b16a0d 09-Apr-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Constifixion of bio_attribute.


# d306122d 26-Mar-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Push BIO_FORMAT into a local hack inside the floppy drivers where
it belongs.


# c58eb46e 23-Mar-2002 Bruce Evans <bde@FreeBSD.org>

Fixed some style bugs in the removal of __P(()). The main ones were
not removing tabs before "__P((", and not outdenting continuation lines
to preserve non-KNF lining up of code with parentheses. Switch to KNF
formatting and/or rewrap the whole prototype in some cases.


# 789f12fe 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P


# 0d2af521 15-Mar-2002 Kirk McKusick <mckusick@FreeBSD.org>

Introduce the new 64-bit size disk block, daddr64_t. Change
the bio and buffer structures to have daddr64_t bio_pblkno,
b_blkno, and b_lblkno fields which allows access to disks
larger than a Terabyte in size. This change also requires
that the VOP_BMAP vnode operation accept and return daddr64_t
blocks. This delta should not affect system operation in
any way. It merely sets up the necessary interfaces to allow
the development of disk drivers that work with these larger
disk block addresses. It also allows for the development of
UFS2 which will use 64-bit block addresses.


# 23b8e169 11-Mar-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Augment struct bio for GEOM.


# 57c10583 22-Feb-2002 Poul-Henning Kamp <phk@FreeBSD.org>

GC: BIO_ORDERED, various infrastructure dealing with BIO_ORDERED.


# 59374580 29-Jun-2001 Joerg Wunsch <joerg@FreeBSD.org>

Define BIO_CMD{1,2}, available for local hacks, similar to the already
existing BIO_FLAG{1,2}. To be used in the fdc(4) driver soon.


# a468031c 06-May-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Actually biofinish(struct bio *, struct devstat *, int error) is more general
than the bioerror().

Most of this patch is generated by scripts.


# 18ee6cea 06-May-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce bioerror(struct bio*, int err, int complete);


# 9039f19f 14-Jan-2001 Poul-Henning Kamp <phk@FreeBSD.org>

A bit of sanity-checking in bioqdisksort(): panic if we recurse.


# e3975643 25-May-2000 Jake Burkholder <jake@FreeBSD.org>

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


# 740a1973 23-May-2000 Jake Burkholder <jake@FreeBSD.org>

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


# 9626b608 05-May-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by: peter


# 0b441832 03-May-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Convert the vm_pager_strategy() interface to take a struct bio instead of
a struct buf. Don't try to examine B_ASYNC, it is a layering violation
to do so. The only current user of this interface is vn(4) which, since
it emulates a disk interface, operates on struct bio already.


# 017ef345 01-May-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Give struct bio it's own call back mechanism.


# 87150cb0 29-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

s/biowait/bufwait/g

Prodded by: several.


# 67f3c95c 25-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Clone the {b|bio}_offset field, and make sure it is always initialized
in struct bio. Eventually, bio_offset will probably obsolete the
bio_blkno and bio_pblkno fields.

Remove the special hack in atapi-cd.c to determine of bio_offset was valid.


# 19583a80 18-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Don't declare common variables in include files:
move buftimelock til vfs_bio.c where it is initialized.


# 8177437d 14-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Complete the bio/buf divorce for all code below devfs::strategy

Exceptions:
Vinum untouched. This means that it cannot be compiled.
Greg Lehey is on the case.

CCD not converted yet, casts to struct buf (still safe)

atapi-cd casts to struct buf to examine B_PHYS


# 282ac69e 02-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Clone bio versions of certain bits of infrastructure:
devstat_end_transaction_bio()
bioq_* versions of bufq_* incl bioqdisksort()
the corresponding "buf" versions will disappear when no longer used.

Move b_offset, b_data and b_bcount to struct bio.

Add BIO_FORMAT as a hack for fd.c etc.

We are now largely ready to start converting drivers to use struct
bio instead of struct buf.


# c244d2de 02-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Move B_ERROR flag to b_ioflags and call it BIO_ERROR.

(Much of this done by script)

Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED.

Move b_pblkno and b_iodone_chain to struct bio while we transition, they
will be obsoleted once bio structs chain/stack.

Add bio_queue field for struct bio aware disksort.

Address a lot of stylistic issues brought up by bde.


# 8c125869 02-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Draw the outline of "struct bio".

Struct bio is the future carrier of I/O requests for "struct buf".


# e4649cfa 01-Apr-2000 Matthew Dillon <dillon@FreeBSD.org>

Change the write-behind code to take more care when starting
async I/O's. The sequential read heuristic has been extended to
cover writes as well. We continue to call cluster_write() normally,
thus blocks in the file will still be reallocated for large (but still
random) I/O's, but I/O will only be initiated for truely sequential
writes.

This solves a number of annoying situations, especially with DBM (hash
method) writes, and also has the side effect of fixing a number of
(stupid) benchmarks.

Reviewed-by: mckusick


# b99c307a 20-Mar-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Rename the existing BUF_STRATEGY() to DEV_STRATEGY()

substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)

substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)

This patch is machine generated except for the ccd.c and buf.h parts.


# 21144e3b 20-Mar-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Remove B_READ, B_WRITE and B_FREEBUF and replace them with a new
field in struct buf: b_iocmd. The b_iocmd is enforced to have
exactly one bit set.

B_WRITE was bogusly defined as zero giving rise to obvious coding
mistakes.

Also eliminate the redundant struct buf flag B_CALL, it can just
as efficiently be done by comparing b_iodone to NULL.

Should you get a panic or drop into the debugger, complaining about
"b_iocmd", don't continue. It is likely to write on your disk
where it should have been reading.

This change is a step in the direction towards a stackable BIO capability.

A lot of this patch were machine generated (Thanks to style(9) compliance!)

Vinum users: Greg has not had time to test this yet, be careful.


# cf60e8e4 09-Jan-2000 Kirk McKusick <mckusick@FreeBSD.org>

Several performance improvements for soft updates have been added:
1) Fastpath deletions. When a file is being deleted, check to see if it
was so recently created that its inode has not yet been written to
disk. If so, the delete can proceed to immediately free the inode.
2) Background writes: No file or block allocations can be done while the
bitmap is being written to disk. To avoid these stalls, the bitmap is
copied to another buffer which is written thus leaving the original
available for futher allocations.
3) Link count tracking. Constantly track the difference in i_effnlink and
i_nlink so that inodes that have had no change other than i_effnlink
need not be written.
4) Identify buffers with rollback dependencies so that the buffer flushing
daemon can choose to skip over them.


# 664a31e4 28-Dec-1999 Peter Wemm <peter@FreeBSD.org>

Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL"
is an application space macro and the applications are supposed to be free
to use it as they please (but cannot). This is consistant with the other
BSD's who made this change quite some time ago. More commits to come.


# 02b00854 21-Dec-1999 Kirk McKusick <mckusick@FreeBSD.org>

Prettyness police: Identify flags in b_xflags with BX_ to distinguish
them from flags in b_flags which are prefixed with B_


# ea94c7b9 11-Dec-1999 Matthew Dillon <dillon@FreeBSD.org>

Synopsis of problem being fixed: Dan Nelson originally reported that
blocks of zeros could wind up in a file written to over NFS by a client.
The problem only occurs a few times per several gigabytes of data. This
problem turned out to be bug #3 below.

bug #1:

B_CLUSTEROK must be cleared when an NFS buffer is reverted from
stage 2 (ready for commit rpc) to stage 1 (ready for write).
Reversions can occur when a dirty NFS buffer is redirtied with new
data.

Otherwise the VFS/BIO system may end up thinking that a stage 1
NFS buffer is clusterable. Stage 1 NFS buffers are not clusterable.

bug #2:

B_CLUSTEROK was inappropriately set for a 'short' NFS buffer (short
buffers only occur near the EOF of the file). Change to only set
when the buffer is a full biosize (usually 8K). This bug has no
effect but should be fixed in -current anyway. It need not be
backported.

bug #3:

B_NEEDCOMMIT was inappropriately set in nfs_flush() (which is
typically only called by the update daemon). nfs_flush()
does a multi-pass loop but due to the lack of vnode locking it
is possible for new buffers to be added to the dirtyblkhd list
while a flush operation is going on. This may result in nfs_flush()
setting B_NEEDCOMMIT on a buffer which has *NOT* yet gone through its
stage 1 write, causing only the commit rpc to be made and thus
causing the contents of the buffer to be thrown away (never sent to
the server).

The patch also contains some cleanup, which only applies to the commit
into -current.

Reviewed by: dg, julian
Originally Reported by: Dan Nelson <dnelson@emsphone.com>


# f8c89187 10-Dec-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Remove the B_BAD buffer flag, it is no longer used.


# 6a54425c 17-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

"b_unused1" was.

Fix comment for b_caller[12] fields.

Spotted by: grog


# 9782fb62 23-Oct-1999 Matthew Dillon <dillon@FreeBSD.org>

Adjust the buffer cache to better handle small-memory machines. A
slightly older version of this code was tested by BDE and I.

Also fixes a lockup situation when kva gets too fragmented.

Remove the maxvmiobufspace variable and sysctl, they are no longer
used. Also cleanup (remove) #if 0 sections from prior commits.

This code is more of a hack, but presumably the whole buffer cache
implementation is going to be rewritten in the next year so it's no
big deal.


# 7179e74f 09-Oct-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Give physio a makeover.

- Let physio take read/write compatible args and have it use uio->uio_rw
to determine the direction.

- physread/physwrite are now #defines for physio

- Remove the inversly named minphys(), dev->si_iosize_max takes over.

- Physio() always uses pbufs.

- Fix the check for non page-aligned transfers, now only unaligned
transfers larger than (MAXPHYS - PAGE_SIZE) get fragmented (only
interesting for tapes using max blocksize).

- General wash-and-clean of code.

Constructive input from: bde


# e701df7d 26-Sep-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix process p_locks accounting. Conversions of the owner to LK_KERNPROC
caused p_locks to be improperly accounted.

Submitted by: Tor.Egge@fast.no


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# 49ff4deb 14-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Spring cleaning around strategy and disklabels/slices:

Introduce BUF_STRATEGY(struct buf *, int flag) macro, and use it throughout.
please see comment in sys/conf.h about the flag argument.

Remove strategy argument from all the diskslice/label/bad144
implementations, it should be found from the dev_t.

Remove bogus and unused strategy1 routines.

Remove open/close arguments from dssize(). Pick them up from dev_t.

Remove unused and unfinished setgeom support from diskslice/label/bad144 code.


# 29a751bf 09-Jul-1999 Peter Wemm <peter@FreeBSD.org>

bufhashinit() is called with a caddr_t and is expected to return the
same in both the alpha and i386 ports.


# ad8ac923 08-Jul-1999 Kirk McKusick <mckusick@FreeBSD.org>

These changes appear to give us benefits with both small (32MB) and
large (1G) memory machine configurations. I was able to run 'dbench 32'
on a 32MB system without bring the machine to a grinding halt.

* buffer cache hash table now dynamically allocated. This will
have no effect on memory consumption for smaller systems and
will help scale the buffer cache for larger systems.

* minor enhancement to pmap_clearbit(). I noticed that
all the calls to it used constant arguments. Making
it an inline allows the constants to propogate to
deeper inlines and should produce better code.

* removal of inherent vfs_ioopt support through the emplacement
of appropriate #ifdef's, with John's permission. If we do not
find a use for it by the end of the year we will remove it entirely.

* removal of getnewbufloops* counters & sysctl's - no longer
necessary for debugging, getnewbuf() is now optimal.

* buffer hash table functions removed from sys/buf.h and localized
to vfs_bio.c

* VFS_BIO_NEED_DIRTYFLUSH flag and support code added
( bwillwrite() ), allowing processes to block when too many dirty
buffers are present in the system.

* removal of a softdep test in bdwrite() that is no longer necessary
now that bdwrite() no longer attempts to flush dirty buffers.

* slight optimization added to bqrelse() - there is no reason
to test for available buffer space on B_DELWRI buffers.

* addition of reverse-scanning code to vfs_bio_awrite().
vfs_bio_awrite() will attempt to locate clusterable areas
in both the forward and reverse direction relative to the
offset of the buffer passed to it. This will probably not
make much of a difference now, but I believe we will start
to rely on it heavily in the future if we decide to shift
some of the burden of the clustering closer to the actual
I/O initiation.

* Removal of the newbufcnt and lastnewbuf counters that Kirk
added. They do not fix any race conditions that haven't already
been fixed by the gbincore() test done after the only call
to getnewbuf(). getnewbuf() is a static, so there is no chance
of it being misused by other modules. ( Unless Kirk can think
of a specific thing that this code fixes. I went through it
very carefully and didn't see anything ).

* removal of VOP_ISLOCKED() check in flushbufqueues(). I do not
think this check is necessary, the buffer should flush properly
whether the vnode is locked or not. ( yes? ).

* removal of extra arguments passed to getnewbuf() that are not
necessary.

* missed cluster_wbuild() that had to be a cluster_wbuild_wb() in
vfs_cluster.c

* vn_write() now calls bwillwrite() *PRIOR* to locking the vnode,
which should greatly aid flushing operations in heavy load
situations - both the pageout and update daemons will be able
to operate more efficiently.

* removal of b_usecount. We may add it back in later but for now
it is useless. Prior implementations of the buffer cache never
had enough buffers for it to be useful, and current implementations
which make more buffers available might not benefit relative to
the amount of sophistication required to implement a b_usecount.
Straight LRU should work just as well, especially when most things
are VMIO backed. I expect that (even though John will not like
this assumption) directories will become VMIO backed some point soon.

Submitted by: Matthew Dillon <dillon@backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>


# e929c00d 03-Jul-1999 Kirk McKusick <mckusick@FreeBSD.org>

The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).

Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.

The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.

reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.

The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.

A small race condition was fixed in getpbuf() in vm/vm_pager.c.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>


# ddebd879 28-Jun-1999 Peter Wemm <peter@FreeBSD.org>

Hopefully fix the remaining glitches with the BUF_*() changes. This should
(really this time) fix pageout to swap and a couple of clustering cases.

This simplifies BUF_KERNPROC() so that it unconditionally reassigns the
lock owner rather than testing B_ASYNC and having the caller decide when
to do the reassign. At present this is required because some places use
B_CALL/b_iodone to free the buffers without B_ASYNC being set. Also,
vfs_cluster.c explicitly calls BUF_KERNPROC() when attaching the buffers
rather than the parent walking the cluster_head tailq.

Reviewed by: Kirk McKusick <mckusick@mckusick.com>


# 79db8f45 27-Jun-1999 Peter Wemm <peter@FreeBSD.org>

The BUF_*() routines must be internally splbio() protected otherwise they
can cause a biodone() from a disk interrupt to spin when the interrupt
code tries to grab the simplelock. Masking BIO here means buftimelock
and/or lk->lk_interlock shouldn't be held when an interrupt tries to grab
them.


# c13d2ba7 27-Jun-1999 Peter Wemm <peter@FreeBSD.org>

Make SMP work again. lockmgr() needed to be told to free the buftimelock
interlock.


# f19fd54a 26-Jun-1999 Peter Wemm <peter@FreeBSD.org>

Quick fix to make libcam compile.. I don't know about the rest of world
yet.


# 67812eac 25-Jun-1999 Kirk McKusick <mckusick@FreeBSD.org>

Convert buffer locking from using the B_BUSY and B_WANTED flags to using
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.


# c48d1775 07-May-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce two functions: physread() and physwrite() and use these directly
in *devsw[] rather than the 46 local copies of the same functions.

(grog will do the same for vinum when he has time)


# b0eeea20 06-May-1999 Poul-Henning Kamp <phk@FreeBSD.org>

remove b_proc from struct buf, it's (now) unused.

Reviewed by: dillon, bde


# 84c55b38 06-May-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unused fields from struct buf:
b_savekva
b_validoff
b_validend

Reviewed by: dillon, bde


# 4221e284 02-May-1999 Alan Cox <alc@FreeBSD.org>

The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.

The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.

getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.

There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.

Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>


# 4ef2094e 11-Mar-1999 Julian Elischer <julian@FreeBSD.org>

Reviewed by: Many at differnt times in differnt parts,
including alan, john, me, luoqi, and kirk
Submitted by: Matt Dillon <dillon@frebsd.org>

This change implements a relatively sophisticated fix to getnewbuf().
There were two problems with getnewbuf(). First, the writerecursion
can lead to a system stack overflow when you have NFS and/or VN
devices in the system. Second, the free/dirty buffer accounting was
completely broken. Not only did the nfs routines blow it trying to
manually account for the buffer state, but the accounting that was
done did not work well with the purpose of their existance: figuring
out when getnewbuf() needs to sleep.

The meat of the change is to kern/vfs_bio.c. The remaining diffs are
all minor except for NFS, which includes both the fixes for bp
interaction AND fixes for a 'biodone(): buffer already done' lockup.
Sys/buf.h also contains a chaining structure which is not used by
this patchset but is used by other patches that are coming soon.
This patch deliniated by tags PRE_MAT_GETBUF and POST_MAT_GETBUF.
(sorry for the missing T matt)


# eef33ce9 01-Mar-1999 Kirk McKusick <mckusick@FreeBSD.org>

When fsync'ing a file on a filesystem using soft updates, we first try
to write all the dirty blocks. If some of those blocks have dependencies,
they will be remarked dirty when the I/O completes. On systems with
really fast I/O systems, it is possible to get in an infinite loop trying
to flush the buffers, because the I/O finishes before we can get all the
dirty buffers off the v_dirtyblkhd list and into the I/O queue. (The
previous algorithm looped over the v_dirtyblkhd list writing out buffers
until the list emptied.) So, now we mark each buffer that we try to
write so that we can distinguish the ones that are being remarked dirty
from those that we have not yet tried to flush. Once we have tried to
push every buffer once, we then push any associated metadata that is
causing the remaining buffers to be redirtied.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>


# a102abb2 21-Jan-1999 Peter Wemm <peter@FreeBSD.org>

Typo: s/bpreassignbuf/pbreassignbuf/ so the prototype matches it's function


# 1c7c3c6a 21-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

This is a rather large commit that encompasses the new swapper,
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.

Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>


# 1c680b45 12-Nov-1998 David Greenman <dg@FreeBSD.org>

Restored the "reallocblks" code to its former glory. What this does is
basically do a on-the-fly defragmentation of the FFS filesystem, changing
file block allocations to make them contiguous. Thanks to Kirk McKusick
for providing hints on what needed to be done to get this working.


# 630ff663 31-Oct-1998 Peter Wemm <peter@FreeBSD.org>

Convert the vnode clean/dirty attached buffer lists from LISTs to TAILQs.
Add a new flags field (we get this for free because of struct packing)
for cleaner management of tailq membership.
We had two spare b_flags slots, but they are a precious resource and may
be needed for other things that are related to other b_flags bits. The two
new flags are convenient to use in a seperate location.

Reviewed (in principle) by: dg
Obtained from: John Dyson's old work-in-progress


# 8f359bc6 03-Oct-1998 Bruce Evans <bde@FreeBSD.org>

Quick fix for not being able to sync all the buffers in boot() if
an ext2fs file system is mounted. The soft update changes added
a check for B_DELWRI buffers. This exposed the complete brokenness
of the previous quick fix for failing syncs (PR 3571, committed on
1997/08/04). Use a new buffer flag B_DIRTY and don't abuse B_DELWRI.
B_DIRTY buffers are still written too late, as broken in the previous
fix. This is fairly harmless, because B_DIRTY is only used for
bitmap buffers and fsck.ext2 can fix up the bitmaps perfectly.

Fixed a race in ULCK_BUF() (bremfree() was outside of the splbio()
section).


# 10baba4b 25-Sep-1998 Peter Wemm <peter@FreeBSD.org>

Goodbye BOUNCE_BUFFERS, for a hack it has served us well.

The last consumer of this code (the old SCSI system) has left us and
the CAM code does it's own bouncing. The isa dma system has been
doing it's own bouncing for a while too.

Reviewed by: core


# e266594c 24-Sep-1998 Luoqi Chen <luoqi@FreeBSD.org>

Eliminate a race in VOP_FSYNC() when softupdates is enabled.
Submitted by: Kirk McKusick <mckusick@McKusick.COM>
Two minor changes are also included,
1. Remove gratuitious checks for error return from vn_lock with LK_RETRY set,
vn_lock should always succeed in these cases.
2. Back out change rev. 1.36->1.37, which unnecessarily makes async mount
a little more unstable. It also keeps us in sync with other BSDs.
Suggested by: Bruce Evans <bde@zeta.org.au>


# eda00cb5 15-Sep-1998 Justin T. Gibbs <gibbs@FreeBSD.org>

When a buffer is removed from a buffer queue, remember it's block number
and use it as "the currently active" buffer in doing disk sort calculations.


# 0375c9f2 05-Sep-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Add a new vnode op, VOP_FREEBLKS(), which filesystems can use to inform
device drivers about sectors no longer in use.

Device-drivers receive the call through d_strategy, if they have
D_CANFREE in d_flags.

This allows flash based devices to erase the sectors and avoid
pointlessly carrying them around in compactions.

Reviewed by: Kirk Mckusick, bde
Sponsored by: M-Systems (www.m-sys.com)


# f821426e 24-Aug-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Remove the last remaining evidence of B_TAP.
Reclaim 3 unused bits in b_flags


# 15c73825 14-Jul-1998 Bruce Evans <bde@FreeBSD.org>

Cast pointers to [u]intptr_t instead of to [unsigned] long.


# 3d572901 13-May-1998 Justin T. Gibbs <gibbs@FreeBSD.org>

Fix bogus "cleanup" in bufq_remove. The "switch point" for tqdisksort was
getting mangled.

Submitted by: Bruce Evans <bde@zeta.org.au>


# bd7b49d9 05-May-1998 Justin T. Gibbs <gibbs@FreeBSD.org>

Now that we have a TAILQ_PREV() that returns the previous object, simplify
some of the buf_queue inline functions.


# 08637435 28-Mar-1998 Bruce Evans <bde@FreeBSD.org>

Moved some #includes from <sys/param.h> nearer to where they are actually
used.


# 0b52b1b3 19-Mar-1998 John Dyson <dyson@FreeBSD.org>

Remove b_generation.


# bef608bd 15-Mar-1998 John Dyson <dyson@FreeBSD.org>

Some VM improvements, including elimination of alot of Sig-11
problems. Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!

pmap.c:
1) Create an object for kernel page table allocations. This
fixes a bogus allocation method previously used for such, by
grabbing pages from the kernel object, using bogus pindexes.
(This was a code cleanup, and perhaps a minor system stability
issue.)

pmap.c:
2) Pre-set the modify and accessed bits when prudent. This will
decrease bus traffic under certain circumstances.

vfs_bio.c, vfs_cluster.c:
3) Rather than calculating the beginning virtual byte offset
multiple times, stick the offset into the buffer header, so
that the calculated offset can be reused. (Long long multiplies
are often expensive, and this is a probably unmeasurable performance
improvement, and code cleanup.)

vfs_bio.c:
4) Handle write recursion more intelligently (but not perfectly) so
that it is less likely to cause a system panic, and is also
much more robust.

vfs_bio.c:
5) getblk incorrectly wrote out blocks that are incorrectly sized.
The problem is fixed, and writes blocks out ONLY when B_DELWRI
is true.

vfs_bio.c:
6) Check that already constituted buffers have fully valid pages. If
not, then make sure that the B_CACHE bit is not set. (This was
a major source of Sig-11 type problems.)

vfs_bio.c:
7) Fix a potential system deadlock due to an incorrectly specified
sleep priority while waiting for a buffer write operation. The
change that I made opens the system up to serious problems, and
we need to examine the issue of process sleep priorities.

vfs_cluster.c, vfs_bio.c:
8) Make clustered reads work more correctly (and more completely)
when buffers are already constituted, but not fully valid.
(This was another system reliability issue.)

vfs_subr.c, ffs_inode.c:
9) Create a vtruncbuf function, which is used by filesystems that
can truncate files. The vinvalbuf forced a file sync type operation,
while vtruncbuf only invalidates the buffers past the new end of file,
and also invalidates the appropriate pages. (This was a system reliabiliy
and performance issue.)

10) Modify FFS to use vtruncbuf.

vm_object.c:
11) Make the object rundown mechanism for OBJT_VNODE type objects work
more correctly. Included in that fix, create pager entries for
the OBJT_DEAD pager type, so that paging requests that might slip
in during race conditions are properly handled. (This was a system
reliability issue.)

vm_page.c:
12) Make some of the page validation routines be a little less picky
about arguments passed to them. Also, support page invalidation
change the object generation count so that we handle generation
counts a little more robustly.

vm_pageout.c:
13) Further reduce pageout daemon activity when the system doesn't
need help from it. There should be no additional performance
decrease even when the pageout daemon is running. (This was
a significant performance issue.)

vnode_pager.c:
14) Teach the vnode pager to handle race conditions during vnode
deallocations.


# b1897c19 08-Mar-1998 Julian Elischer <julian@FreeBSD.org>

Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman)
Submitted by: Kirk McKusick (mcKusick@mckusick.com)
Obtained from: WHistle development tree


# 8f9110f6 07-Mar-1998 John Dyson <dyson@FreeBSD.org>

This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.

1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.


# 2d8acc0f 22-Jan-1998 John Dyson <dyson@FreeBSD.org>

VM level code cleanups.

1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.

This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)

This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)


# ab3f7469 02-Dec-1997 Poul-Henning Kamp <phk@FreeBSD.org>

In all such uses of struct buf: 's/b_un.b_addr/b_data/g'


# 19166912 23-Oct-1997 Justin T. Gibbs <gibbs@FreeBSD.org>

Fix a potentially disasterous '==' instead of '=' bug.

Submitted by: John-Mark Gurney <gurney_j@resnet.uoregon.edu>


# 5957b261 21-Sep-1997 Justin T. Gibbs <gibbs@FreeBSD.org>

buf.h:
Change the definition of a buffer queue so that bufqdisksort can
properly deal with bordered writes.

Add inline functions for accessing buffer queues. This should be
considered an opaque data structure by clients.

callout.h:
New callout implementation.

device.h:
Add support for CAM interrupts.

disk.h:
disklabel.h:
tqdisksort->bufqdisksort

kernel.h:
Add new configuration entries for configuration hooks and calling
cpu_rootconf and cpu_dumpconf.

param.h:
Add a priority for sleeping waiting on config hooks.

proc.h:
Update for new callout implementation.

queue.h:
Add TAILQ_HEAD_INITIALIZER from NetBSD.

systm.h:
Add prototypes for cpu_root/dumpconf, splcam, splsoftcam, etc..


# 514ede09 16-Sep-1997 Bruce Evans <bde@FreeBSD.org>

Fixed gratuitous ANSIisms.


# 2d85d0df 07-Sep-1997 Bruce Evans <bde@FreeBSD.org>

Some staticized variables were still declared to be extern.


# 6b195d32 15-Jun-1997 John Dyson <dyson@FreeBSD.org>

Fix a problem with the VN device. Specifically, the VN device can
cause a problem of spiraling death due to buffer resource limitations.
The vfs_bio code in general had little ability to handle buffer resource
management, and now it does. Also, there are a lot more knobs for tuning the
vfs_bio code now. The knobs came free because of the need that there
always be some immediately available buffers (non-delayed or locked) for
use. Note that the buffer cache code is much less likely to get bogged
down with lots of delayed writes, even more so than before.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 8b612c4b 28-Dec-1996 John Dyson <dyson@FreeBSD.org>

This commit is the embodiment of some VFS read clustering improvements.
Firstly, now our read-ahead clustering is on a file descriptor basis and not
on a per-vnode basis. This will allow multiple processes reading the
same file to take advantage of read-ahead clustering. Secondly, there
previously was a problem with large reads still using the ramp-up
algorithm. Of course, that was bogus, and now we read the entire
"chunk" off of the disk in one operation. The read-ahead clustering
algorithm should use less CPU than the previous also (I hope :-)).

NOTE: THAT LKMS MUST BE REBUILT!!!


# 09e0c6cc 30-Nov-1996 John Dyson <dyson@FreeBSD.org>

Implement a new totally dynamic (up to MAXPHYS) buffer kva allocation
scheme. Additionally, add the capability for checking for unexpected
kernel page faults. The maximum amount of kva space for buffers hasn't
been decreased from where it is, but it will now be possible to do so.

This scheme manages the kva space similar to the buffers themselves. If
there isn't enough kva space because of usage or fragementation, buffers
will be reclaimed until a buffer allocation is successful. This scheme
should be very resistant to fragmentation problems until/if the LFS code
is fixed and uses the bogus buffer locking scheme -- but a 'fixed' LFS
is not likely to use such a scheme.

Now there should be NO problem allocating buffers up to MAXPHYS.


# 18acc00c 13-Oct-1996 Poul-Henning Kamp <phk@FreeBSD.org>

Remove some old compatibility names.


# 2eb24859 05-Sep-1996 Justin T. Gibbs <gibbs@FreeBSD.org>

Add B_ORDERED buffer flag and prototype for the bowrite function.

Bowrite guarantees that buffers queued after a call to bowrite will
be written after the specified buffer (to a particular device).
Bowrite does this either by taking advantage of hardware ordering support
(e.g. tagged queueing on SCSI devices) or by resorting to a synchronous write.


# 5903c0ce 03-May-1996 Poul-Henning Kamp <phk@FreeBSD.org>

Remove buf->b_actf, nobody uses it anymore.
Clean up some pmap macro usage.


# 2043dc9a 30-Apr-1996 Bruce Evans <bde@FreeBSD.org>

Removed bogus _BEGIN_DECLS/_END_DECLS.

Removed unused struct tag declarations in cloned code.

Added or cleaned up idempotency ifdefs.


# d4893bd8 07-Apr-1996 Poul-Henning Kamp <phk@FreeBSD.org>

remove b_actb, it's not used anywhere.


# 0f21b147 10-Mar-1996 Jeffrey Hsu <hsu@FreeBSD.org>

Merge in Lite2: rename B_APPENDWRITE to B_NEEDCOMMIT.
The other change to b_pfcent is no longer pertinent, since we've
deleted that field.
Reviewed by: davidg & bde


# 6c5e9bbd 30-Jan-1996 Mike Pritchard <mpp@FreeBSD.org>

Fix a bunch of spelling errors in the comment fields of
a bunch of system include files.


# bd7e5f99 18-Jan-1996 John Dyson <dyson@FreeBSD.org>

Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.


# d079690c 28-Dec-1995 David Greenman <dg@FreeBSD.org>

Made bzero a function vector and added a 586/686 optimized version of
bzero.
Deprecated blkclr (removed it).
Removed some old cruft from cpufunc.h.

The optimized bzero was submitted by Torbjorn Granlund <tege@matematik.su.se>
The kernel adaption and other changes by me.


# a316d390 10-Dec-1995 John Dyson <dyson@FreeBSD.org>

Changes to support 1Tb filesizes. Pages are now named by an
(object,index) pair instead of (object,offset) pair.


# fe66bbf4 19-Nov-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Fully prototype physio().


# 68a2196f 19-Nov-1995 John Dyson <dyson@FreeBSD.org>

First set of changes to eliminate the ad-hoc device buffer queues,
replacing them with TAILQ's as appropriate. The SCSI code is the
first to be changed -- until the changes are complete, both b_act and
b_actf will be in the buf structure. b_actf will eventually be removed.


# 5fe17eeb 19-Nov-1995 John Dyson <dyson@FreeBSD.org>

General fixes to the vfs clustring code:

1) Make cluster buffer list be a non-malloced chain. This eliminates
yet another 'evil' M_WAITOK and generally cleans up the code.
2) Fix write clustering for ext2fs. It was just broken. Also, ffs
clustering had an efficiency problem that more bawrites were happening
than should have been.
3) Make changes to buf.h to support the above, plus remove b_pfcent
at the request of David Greenman.
Reviewed by: davidg (partially)


# cbf23505 23-Aug-1995 David Greenman <dg@FreeBSD.org>

Improved BUFHASH algorithm.


# 28f8db14 29-Jul-1995 Bruce Evans <bde@FreeBSD.org>

Eliminate sloppy common-style declarations. There should be none left for
the LINT configuation.


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# aba8f38e 19-Apr-1995 David Greenman <dg@FreeBSD.org>

New flag: B_PAGING. Added as part of the vn driver hack.


# f6b04d2b 09-Apr-1995 David Greenman <dg@FreeBSD.org>

Changes from John Dyson and myself:

Fixed remaining known bugs in the buffer IO and VM system.

vfs_bio.c:
Fixed some race conditions and locking bugs. Improved performance
by removing some (now) unnecessary code and fixing some broken
logic.
Fixed process accounting of # of FS outputs.
Properly handle NFS interrupts (B_EINTR).

(various)
Replaced calls to clrbuf() with calls to an optimized routine
called vfs_bio_clrbuf().

(various FS sync)
Sync out modified vnode_pager backed pages.

ffs_vnops.c:
Do two passes: Sync out file data first, then indirect blocks.

vm_fault.c:
Fixed deadly embrace caused by acquiring locks in the wrong order.

vnode_pager.c:
Changed to use buffer I/O system for writing out modified pages. This
should fix the problem with the modification date previous not getting
updated. Also dramatically simplifies the code. Note that this is
going to change in the future and be implemented via VOP_PUTPAGES().

vm_object.c:
Fixed a pile of bugs related to cleaning (vnode) objects. The performance
of vm_object_page_clean() is terrible when dealing with huge objects,
but this will change when we implement a binary tree to keep the object
pages sorted.

vm_pageout.c:
Fixed broken clustering of pageouts. Fixed race conditions and other
lockup style bugs in the scanning of pages. Improved performance.


# 3aa12267 28-Mar-1995 Bruce Evans <bde@FreeBSD.org>

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) that I didn't notice when I fixed
"all" such warnings before.


# f57459b6 26-Mar-1995 David Greenman <dg@FreeBSD.org>

Removed third arg (vmio) to allocbuf() that was added with the original
merged cache changes, and figure it out based on the B_VMIO buffer flag.
Fixes a problem where delayed write VMIO buffers would sometimes get
recopied into kernel-alloced memory.

Submitted by: John Dyson


# 76365205 25-Mar-1995 David Greenman <dg@FreeBSD.org>

Removed an old VMIO #ifdef and made the type of b_pages 'struct vm_page *'.


# b5e8ce9f 16-Mar-1995 Bruce Evans <bde@FreeBSD.org>

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# 9d2ccc8e 18-Feb-1995 Bruce Evans <bde@FreeBSD.org>

New field b_biodone_chain to support nested b_iodone's.


# 0d94caff 09-Jan-1995 David Greenman <dg@FreeBSD.org>

These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.

The majority of the merged VM/cache work is by John Dyson.

The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.

vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.

vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.

vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.

vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.

vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.

pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.

vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.

proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.

swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.

machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.

machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.

ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.

Submitted by: John Dyson and David Greenman


# 08d7d166 18-Oct-1994 David Greenman <dg@FreeBSD.org>

Removed references to bclnlist which we don't use/support/need.


# 3d05297c 09-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

Cosmetics. (sort of) Added 19 prototypes.


# 8e58bf68 05-Oct-1994 David Greenman <dg@FreeBSD.org>

Stuff object into v_vmdata rather than pager. Not important which at
the moment, but will be in the future. Other changes mostly cosmetic,
but are made for future VMIO considerations.

Submitted by: John Dyson


# bb56ec4a 25-Sep-1994 Poul-Henning Kamp <phk@FreeBSD.org>

While in the real world, I had a bad case of being swapped out for a lot of
cycles. While waiting there I added a lot of the extra ()'s I have, (I have
never used LISP to any extent). So I compiled the kernel with -Wall and
shut up a lot of "suggest you add ()'s", removed a bunch of unused var's
and added a couple of declarations here and there. Having a lap-top is
highly recommended. My kernel still runs, yell at me if you kernel breaks.


# f23b4c91 18-Aug-1994 Garrett Wollman <wollman@FreeBSD.org>

Fix up some sloppy coding practices:

- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.

NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.


# 16f62314 06-Aug-1994 David Greenman <dg@FreeBSD.org>

Incorporated post 1.1.5 work from John Dyson. This includes performance
improvements via the new routines pmap_qenter/pmap_qremove and pmap_kenter/
pmap_kremove. These routine allow fast mapping of pages for those
architectures that have "normal" MMUs. Also included is a fix to the
pageout daemon to properly check a queue end condition.

Submitted by: John Dyson


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# 594110fe 26-May-1994 David Greenman <dg@FreeBSD.org>

Moved header definitions to buf.h, and added missing splx() - found
by Johannes Helander.


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources