History log of /freebsd-11-stable/sys/geom/geom.h
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 347378 09-May-2019 kevans

MFC r346602, r346670-r346671, r347183: tun/tap race fixes

r346602:
tun(4): Defer clearing TUN_OPEN until much later

tun destruction will not continue until TUN_OPEN is cleared. There are brief
moments in tunclose where the mutex is dropped and we've already cleared
TUN_OPEN, so tun_destroy would be able to proceed while we're in the middle
of cleaning up the tun still. tun_destroy should be blocked until these
parts (address/route purges, mostly) are complete.

r346670:
tun/tap: close race between destroy/ioctl handler

It seems that there should be a better way to handle this, but this seems to
be the more common approach and it should likely get replaced in all of the
places it happens... Basically, thread 1 is in the process of destroying the
tun/tap while thread 2 is executing one of the ioctls that requires the
tun/tap mutex and the mutex is destroyed before the ioctl handler can
acquire it.

This is only one of the races described/found in PR 233955.

r346671:
tun(4): Don't allow open of open or dying devices

Previously, a pid check was used to prevent open of the tun(4); this works,
but may not make the most sense as we don't prevent the owner process from
opening the tun device multiple times.

The potential race described near tun_pid should not be an issue: if a
tun(4) is to be handed off, its fd has to have been sent via control message
or some other mechanism that duplicates the fd to the receiving process so
that it may set the pid. Otherwise, the pid gets cleared when the original
process closes it and you have no effective handoff mechanism.

Close up another potential issue with handing a tun(4) off by not clobbering
state if the closer isn't the controller anymore. If we want some state to
be cleared, we should do that a little more surgically.

Additionally, nothing prevents a dying tun(4) from being "reopened" in the
middle of tun_destroy as soon as the mutex is unlocked, quickly leading to a
bad time. Return EBUSY if we're marked for destruction, as well, and the
consumer will need to deal with it. The associated character device will be
destroyed in short order.

r347183:
geom: fix initialization order

There's a race between the initialization of devsoftc.mtx (by devinit)
and the creation of the geom worker thread g_run_events, which calls
devctl_queue_data_f. Both of those are initialized at SI_SUB_DRIVERS
and SI_ORDER_FIRST, which means the geom worked thread can be created
before the mutex has been initialized, leading to the panic below:

wpanic: mtx_lock() of spin mutex (null) @ /usr/home/osstest/build.135317.build-amd64-freebsd/freebsd/sys/kern/subr_bus.c:620
cpuid = 3
time = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe003b968710
vpanic() at vpanic+0x19d/frame 0xfffffe003b968760
panic() at panic+0x43/frame 0xfffffe003b9687c0
__mtx_lock_flags() at __mtx_lock_flags+0x145/frame 0xfffffe003b968810
devctl_queue_data_f() at devctl_queue_data_f+0x6a/frame 0xfffffe003b968840
g_dev_taste() at g_dev_taste+0x463/frame 0xfffffe003b968a00
g_load_class() at g_load_class+0x1bc/frame 0xfffffe003b968a30
g_run_events() at g_run_events+0x197/frame 0xfffffe003b968a70
fork_exit() at fork_exit+0x84/frame 0xfffffe003b968ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe003b968ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
[ thread pid 13 tid 100029 ]
Stopped at kdb_enter+0x3b: movq $0,kdb_why

Fix this by initializing geom at SI_ORDER_SECOND instead of
SI_ORDER_FIRST.

PR: 233955


# 332095 06-Apr-2018 avg

MFC r330977: g_access: deal with races created by geoms that drop the topology lock

PR: 225960


# 302408 07-Jul-2016 gjb

Copy head@r302406 to stable/11 as part of the 11.0-RELEASE cycle.
Prune svn:mergeinfo from the new branch, as nothing has been merged
here.

Additional commits post-branch will follow.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


/freebsd-11-stable/MAINTAINERS
/freebsd-11-stable/cddl
/freebsd-11-stable/cddl/contrib/opensolaris
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/dtrace/test/tst/common/print
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zfs
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs
/freebsd-11-stable/contrib/amd
/freebsd-11-stable/contrib/apr
/freebsd-11-stable/contrib/apr-util
/freebsd-11-stable/contrib/atf
/freebsd-11-stable/contrib/binutils
/freebsd-11-stable/contrib/bmake
/freebsd-11-stable/contrib/byacc
/freebsd-11-stable/contrib/bzip2
/freebsd-11-stable/contrib/com_err
/freebsd-11-stable/contrib/compiler-rt
/freebsd-11-stable/contrib/dialog
/freebsd-11-stable/contrib/dma
/freebsd-11-stable/contrib/dtc
/freebsd-11-stable/contrib/ee
/freebsd-11-stable/contrib/elftoolchain
/freebsd-11-stable/contrib/elftoolchain/ar
/freebsd-11-stable/contrib/elftoolchain/brandelf
/freebsd-11-stable/contrib/elftoolchain/elfdump
/freebsd-11-stable/contrib/expat
/freebsd-11-stable/contrib/file
/freebsd-11-stable/contrib/gcc
/freebsd-11-stable/contrib/gcclibs/libgomp
/freebsd-11-stable/contrib/gdb
/freebsd-11-stable/contrib/gdtoa
/freebsd-11-stable/contrib/groff
/freebsd-11-stable/contrib/ipfilter
/freebsd-11-stable/contrib/ldns
/freebsd-11-stable/contrib/ldns-host
/freebsd-11-stable/contrib/less
/freebsd-11-stable/contrib/libarchive
/freebsd-11-stable/contrib/libarchive/cpio
/freebsd-11-stable/contrib/libarchive/libarchive
/freebsd-11-stable/contrib/libarchive/libarchive_fe
/freebsd-11-stable/contrib/libarchive/tar
/freebsd-11-stable/contrib/libc++
/freebsd-11-stable/contrib/libc-vis
/freebsd-11-stable/contrib/libcxxrt
/freebsd-11-stable/contrib/libexecinfo
/freebsd-11-stable/contrib/libpcap
/freebsd-11-stable/contrib/libstdc++
/freebsd-11-stable/contrib/libucl
/freebsd-11-stable/contrib/libxo
/freebsd-11-stable/contrib/llvm
/freebsd-11-stable/contrib/llvm/projects/libunwind
/freebsd-11-stable/contrib/llvm/tools/clang
/freebsd-11-stable/contrib/llvm/tools/lldb
/freebsd-11-stable/contrib/llvm/tools/llvm-dwarfdump
/freebsd-11-stable/contrib/llvm/tools/llvm-lto
/freebsd-11-stable/contrib/mdocml
/freebsd-11-stable/contrib/mtree
/freebsd-11-stable/contrib/ncurses
/freebsd-11-stable/contrib/netcat
/freebsd-11-stable/contrib/ntp
/freebsd-11-stable/contrib/nvi
/freebsd-11-stable/contrib/one-true-awk
/freebsd-11-stable/contrib/openbsm
/freebsd-11-stable/contrib/openpam
/freebsd-11-stable/contrib/openresolv
/freebsd-11-stable/contrib/pf
/freebsd-11-stable/contrib/sendmail
/freebsd-11-stable/contrib/serf
/freebsd-11-stable/contrib/sqlite3
/freebsd-11-stable/contrib/subversion
/freebsd-11-stable/contrib/tcpdump
/freebsd-11-stable/contrib/tcsh
/freebsd-11-stable/contrib/tnftp
/freebsd-11-stable/contrib/top
/freebsd-11-stable/contrib/top/install-sh
/freebsd-11-stable/contrib/tzcode/stdtime
/freebsd-11-stable/contrib/tzcode/zic
/freebsd-11-stable/contrib/tzdata
/freebsd-11-stable/contrib/unbound
/freebsd-11-stable/contrib/vis
/freebsd-11-stable/contrib/wpa
/freebsd-11-stable/contrib/xz
/freebsd-11-stable/crypto/heimdal
/freebsd-11-stable/crypto/openssh
/freebsd-11-stable/crypto/openssl
/freebsd-11-stable/gnu/lib
/freebsd-11-stable/gnu/usr.bin/binutils
/freebsd-11-stable/gnu/usr.bin/cc/cc_tools
/freebsd-11-stable/gnu/usr.bin/gdb
/freebsd-11-stable/lib/libc/locale/ascii.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris
/freebsd-11-stable/sys/contrib/dev/acpica
/freebsd-11-stable/sys/contrib/ipfilter
/freebsd-11-stable/sys/contrib/libfdt
/freebsd-11-stable/sys/contrib/octeon-sdk
/freebsd-11-stable/sys/contrib/x86emu
/freebsd-11-stable/sys/contrib/xz-embedded
/freebsd-11-stable/usr.sbin/bhyve/atkbdc.h
/freebsd-11-stable/usr.sbin/bhyve/bhyvegc.c
/freebsd-11-stable/usr.sbin/bhyve/bhyvegc.h
/freebsd-11-stable/usr.sbin/bhyve/console.c
/freebsd-11-stable/usr.sbin/bhyve/console.h
/freebsd-11-stable/usr.sbin/bhyve/pci_fbuf.c
/freebsd-11-stable/usr.sbin/bhyve/pci_xhci.c
/freebsd-11-stable/usr.sbin/bhyve/pci_xhci.h
/freebsd-11-stable/usr.sbin/bhyve/ps2kbd.c
/freebsd-11-stable/usr.sbin/bhyve/ps2kbd.h
/freebsd-11-stable/usr.sbin/bhyve/ps2mouse.c
/freebsd-11-stable/usr.sbin/bhyve/ps2mouse.h
/freebsd-11-stable/usr.sbin/bhyve/rfb.c
/freebsd-11-stable/usr.sbin/bhyve/rfb.h
/freebsd-11-stable/usr.sbin/bhyve/sockstream.c
/freebsd-11-stable/usr.sbin/bhyve/sockstream.h
/freebsd-11-stable/usr.sbin/bhyve/usb_emul.c
/freebsd-11-stable/usr.sbin/bhyve/usb_emul.h
/freebsd-11-stable/usr.sbin/bhyve/usb_mouse.c
/freebsd-11-stable/usr.sbin/bhyve/vga.c
/freebsd-11-stable/usr.sbin/bhyve/vga.h
# 300287 20-May-2016 kib

Remove asserts that Giant is not held on entrance into geom KPI, which
outlived their usefulness. This allows to remove drop/pickup Giant
wrappers around GEOM calls.

Discussed with: alfred, imp, phk
Sponsored by: The FreeBSD Foundation


# 300207 19-May-2016 ken

Add support for managing Shingled Magnetic Recording (SMR) drives.

This change includes support for SCSI SMR drives (which conform to the
Zoned Block Commands or ZBC spec) and ATA SMR drives (which conform to
the Zoned ATA Command Set or ZAC spec) behind SAS expanders.

This includes full management support through the GEOM BIO interface, and
through a new userland utility, zonectl(8), and through camcontrol(8).

This is now ready for filesystems to use to detect and manage zoned drives.
(There is no work in progress that I know of to use this for ZFS or UFS, if
anyone is interested, let me know and I may have some suggestions.)

Also, improve ATA command passthrough and dispatch support, both via ATA
and ATA passthrough over SCSI.

Also, add support to camcontrol(8) for the ATA Extended Power Conditions
feature set. You can now manage ATA device power states, and set various
idle time thresholds for a drive to enter lower power states.

Note that this change cannot be MFCed in full, because it depends on
changes to the struct bio API that break compatilibity. In order to
avoid breaking the stable API, only changes that don't touch or depend on
the struct bio changes can be merged. For example, the camcontrol(8)
changes don't depend on the new bio API, but zonectl(8) and the probe
changes to the da(4) and ada(4) drivers do depend on it.

Also note that the SMR changes have not yet been tested with an actual
SCSI ZBC device, or a SCSI to ATA translation layer (SAT) that supports
ZBC to ZAC translation. I have not yet gotten a suitable drive or SAT
layer, so any testing help would be appreciated. These changes have been
tested with Seagate Host Aware SATA drives attached to both SAS and SATA
controllers. Also, I do not have any SATA Host Managed devices, and I
suspect that it may take additional (hopefully minor) changes to support
them.

Thanks to Seagate for supplying the test hardware and answering questions.

sbin/camcontrol/Makefile:
Add epc.c and zone.c.

sbin/camcontrol/camcontrol.8:
Document the zone and epc subcommands.

sbin/camcontrol/camcontrol.c:
Add the zone and epc subcommands.

Add auxiliary register support to build_ata_cmd(). Make sure to
set the CAM_ATAIO_NEEDRESULT, CAM_ATAIO_DMA, and CAM_ATAIO_FPDMA
flags as appropriate for ATA commands.

Add a new get_ata_status() function to parse ATA result from SCSI
sense descriptors (for ATA passthrough over SCSI) and ATA I/O
requests.

sbin/camcontrol/camcontrol.h:
Update the build_ata_cmd() prototype

Add get_ata_status(), zone(), and epc().

sbin/camcontrol/epc.c:
Support for ATA Extended Power Conditions features. This includes
support for all features documented in the ACS-4 Revision 12
specification from t13.org (dated February 18, 2016).

The EPC feature set allows putting a drive into a power power mode
immediately, or setting timeouts so that the drive will
automatically enter progressively lower power states after various
idle times.

sbin/camcontrol/fwdownload.c:
Update the firmware download code for the new build_ata_cmd()
arguments.

sbin/camcontrol/zone.c:
Implement support for Shingled Magnetic Recording (SMR) drives
via SCSI Zoned Block Commands (ZBC) and ATA Zoned Device ATA
Command Set (ZAC).

These specs were developed in concert, and are functionally
identical. The primary differences are due to SCSI and ATA
differences. (SCSI is big endian, ATA is little endian, for
example.)

This includes support for all commands defined in the ZBC and
ZAC specs.

sys/cam/ata/ata_all.c:
Decode a number of additional ATA command names in ata_op_string().

Add a new CCB building function, ata_read_log().

Add ata_zac_mgmt_in() and ata_zac_mgmt_out() CCB building
functions. These support both DMA and NCQ encapsulation.

sys/cam/ata/ata_all.h:
Add prototypes for ata_read_log(), ata_zac_mgmt_out(), and
ata_zac_mgmt_in().

sys/cam/ata/ata_da.c:
Revamp the ada(4) driver to support zoned devices.

Add four new probe states to gather information needed for zone
support.

Add a new adasetflags() function to avoid duplication of large
blocks of flag setting between the async handler and register
functions.

Add new sysctl variables that describe zone support and paramters.

Add support for the new BIO_ZONE bio, and all of its subcommands:
DISK_ZONE_OPEN, DISK_ZONE_CLOSE, DISK_ZONE_FINISH, DISK_ZONE_RWP,
DISK_ZONE_REPORT_ZONES, and DISK_ZONE_GET_PARAMS.

sys/cam/scsi/scsi_all.c:
Add command descriptions for the ZBC IN/OUT commands.

Add descriptions for ZBC Host Managed devices.

Add a new function, scsi_ata_pass() to do ATA passthrough over
SCSI. This will eventually replace scsi_ata_pass_16() -- it
can create the 12, 16, and 32-byte variants of the ATA
PASS-THROUGH command, and supports setting all of the
registers defined as of SAT-4, Revision 5 (March 11, 2016).

Change scsi_ata_identify() to use scsi_ata_pass() instead of
scsi_ata_pass_16().

Add a new scsi_ata_read_log() function to facilitate reading
ATA logs via SCSI.

sys/cam/scsi/scsi_all.h:
Add the new ATA PASS-THROUGH(32) command CDB. Add extended and
variable CDB opcodes.

Add Zoned Block Device Characteristics VPD page.

Add ATA Return SCSI sense descriptor.

Add prototypes for scsi_ata_read_log() and scsi_ata_pass().

sys/cam/scsi/scsi_da.c:
Revamp the da(4) driver to support zoned devices.

Add five new probe states, four of which are needed for ATA
devices.

Add five new sysctl variables that describe zone support and
parameters.

The da(4) driver supports SCSI ZBC devices, as well as ATA ZAC
devices when they are attached via a SCSI to ATA Translation (SAT)
layer. Since ZBC -> ZAC translation is a new feature in the T10
SAT-4 spec, most SATA drives will be supported via ATA commands
sent via the SCSI ATA PASS-THROUGH command. The da(4) driver will
prefer the ZBC interface, if it is available, for performance
reasons, but will use the ATA PASS-THROUGH interface to the ZAC
command set if the SAT layer doesn't support translation yet.
As I mentioned above, ZBC command support is untested.

Add support for the new BIO_ZONE bio, and all of its subcommands:
DISK_ZONE_OPEN, DISK_ZONE_CLOSE, DISK_ZONE_FINISH, DISK_ZONE_RWP,
DISK_ZONE_REPORT_ZONES, and DISK_ZONE_GET_PARAMS.

Add scsi_zbc_in() and scsi_zbc_out() CCB building functions.

Add scsi_ata_zac_mgmt_out() and scsi_ata_zac_mgmt_in() CCB/CDB
building functions. Note that these have return values, unlike
almost all other CCB building functions in CAM. The reason is
that they can fail, depending upon the particular combination
of input parameters. The primary failure case is if the user
wants NCQ, but fails to specify additional CDB storage. NCQ
requires using the 32-byte version of the SCSI ATA PASS-THROUGH
command, and the current CAM CDB size is 16 bytes.

sys/cam/scsi/scsi_da.h:
Add ZBC IN and ZBC OUT CDBs and opcodes.

Add SCSI Report Zones data structures.

Add scsi_zbc_in(), scsi_zbc_out(), scsi_ata_zac_mgmt_out(), and
scsi_ata_zac_mgmt_in() prototypes.

sys/dev/ahci/ahci.c:
Fix SEND / RECEIVE FPDMA QUEUED in the ahci(4) driver.

ahci_setup_fis() previously set the top bits of the sector count
register in the FIS to 0 for FPDMA commands. This is okay for
read and write, because the PRIO field is in the only thing in
those bits, and we don't implement that further up the stack.

But, for SEND and RECEIVE FPDMA QUEUED, the subcommand is in that
byte, so it needs to be transmitted to the drive.

In ahci_setup_fis(), always set the the top 8 bits of the
sector count register. We need it in both the standard
and NCQ / FPDMA cases.

sys/geom/eli/g_eli.c:
Pass BIO_ZONE commands through the GELI class.

sys/geom/geom.h:
Add g_io_zonecmd() prototype.

sys/geom/geom_dev.c:
Add new DIOCZONECMD ioctl, which allows sending zone commands to
disks.

sys/geom/geom_disk.c:
Add support for BIO_ZONE commands.

sys/geom/geom_disk.h:
Add a new flag, DISKFLAG_CANZONE, that indicates that a given
GEOM disk client can handle BIO_ZONE commands.

sys/geom/geom_io.c:
Add a new function, g_io_zonecmd(), that handles execution of
BIO_ZONE commands.

Add permissions check for BIO_ZONE commands.

Add command decoding for BIO_ZONE commands.

sys/geom/geom_subr.c:
Add DDB command decoding for BIO_ZONE commands.

sys/kern/subr_devstat.c:
Record statistics for REPORT ZONES commands. Note that the
number of bytes transferred for REPORT ZONES won't quite match
what is received from the harware. This is because we're
necessarily counting bytes coming from the da(4) / ada(4) drivers,
which are using the disk_zone.h interface to communicate up
the stack. The structure sizes it uses are slightly different
than the SCSI and ATA structure sizes.

sys/sys/ata.h:
Add many bit and structure definitions for ZAC, NCQ, and EPC
command support.

sys/sys/bio.h:
Convert the bio_cmd field to a straight enumeration. This will
yield more space for additional commands in the future. After
change r297955 and other related changes, this is now possible.
Converting to an enumeration will also prevent use as a bitmask
in the future.

sys/sys/disk.h:
Define the DIOCZONECMD ioctl.

sys/sys/disk_zone.h:
Add a new API for managing zoned disks. This is very close to
the SCSI ZBC and ATA ZAC standards, but uses integers in native
byte order instead of big endian (SCSI) or little endian (ATA)
byte arrays.

This is intended to offer to the complete feature set of the ZBC
and ZAC disk management without requiring the application developer
to include SCSI or ATA headers. We also use one set of headers
for ioctl consumers and kernel bio-level consumers.

sys/sys/param.h:
Bump __FreeBSD_version for sys/bio.h command changes, and inclusion
of SMR support.

usr.sbin/Makefile:
Add the zonectl utility.

usr.sbin/diskinfo/diskinfo.c
Add disk zoning capability to the 'diskinfo -v' output.

usr.sbin/zonectl/Makefile:
Add zonectl makefile.

usr.sbin/zonectl/zonectl.8
zonectl(8) man page.

usr.sbin/zonectl/zonectl.c
The zonectl(8) utility. This allows managing SCSI or ATA zoned
disks via the disk_zone.h API. You can report zones, reset write
pointers, get parameters, etc.

Sponsored by: Spectra Logic
Differential Revision: https://reviews.freebsd.org/D6147
Reviewed by: wblock (documentation)


# 295707 17-Feb-2016 imp

Create an API to reset a struct bio (g_reset_bio). This is mandatory
for all struct bio you get back from g_{new,alloc}_bio. Temporary
bios that you create on the stack or elsewhere should use this before
first use of the bio, and between uses of the bio. At the moment, it
is nothing more than a wrapper around bzero, but that may change in
the future. The wrapper also removes one place where we encode the
size of struct bio in the KBI.


# 256956 23-Oct-2013 smh

Improve ZFS N-way mirror read performance by using load and locality
information.

The existing algorithm selects a preferred leaf vdev based on offset of the zio
request modulo the number of members in the mirror. It assumes the devices are
of equal performance and that spreading the requests randomly over both drives
will be sufficient to saturate them. In practice this results in the leaf vdevs
being under utilized.

The new algorithm takes into the following additional factors:
* Load of the vdevs (number outstanding I/O requests)
* The locality of last queued I/O vs the new I/O request.

Within the locality calculation additional knowledge about the underlying vdev
is considered such as; is the device backing the vdev a rotating media device.

This results in performance increases across the board as well as significant
increases for predominantly streaming loads and for configurations which don't
have evenly performing devices.

The following are results from a setup with 3 Way Mirror with 2 x HD's and
1 x SSD from a basic test running multiple parrallel dd's.

With pre-fetch disabled (vfs.zfs.prefetch_disable=1):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s

With pre-fetch enabled (vfs.zfs.prefetch_disable=0):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s

In addition to the performance changes the code was also restructured, with
the help of Justin Gibbs, to provide a more logical flow which also ensures
vdevs loads are only calculated from the set of valid candidates.

The following additional sysctls where added to allow the administrator
to tune the behaviour of the load algorithm:
* vfs.zfs.vdev.mirror.rotating_inc
* vfs.zfs.vdev.mirror.rotating_seek_inc
* vfs.zfs.vdev.mirror.rotating_seek_offset
* vfs.zfs.vdev.mirror.non_rotating_inc
* vfs.zfs.vdev.mirror.non_rotating_seek_inc

These changes where based on work started by the zfsonlinux developers:
https://github.com/zfsonlinux/zfs/pull/1487

Reviewed by: gibbs, mav, will
MFC after: 2 weeks
Sponsored by: Multiplay


# 256880 22-Oct-2013 mav

Merge GEOM direct dispatch changes from the projects/camlock branch.

When safety requirements are met, it allows to avoid passing I/O requests
to GEOM g_up/g_down thread, executing them directly in the caller context.
That allows to avoid CPU bottlenecks in g_up/g_down threads, plus avoid
several context switches per I/O.

The defined now safety requirements are:
- caller should not hold any locks and should be reenterable;
- callee should not depend on GEOM dual-threaded concurency semantics;
- on the way down, if request is unmapped while callee doesn't support it,
the context should be sleepable;
- kernel thread stack usage should be below 50%.

To keep compatibility with GEOM classes not meeting above requirements
new provider and consumer flags added:
- G_CF_DIRECT_SEND -- consumer code meets caller requirements (request);
- G_CF_DIRECT_RECEIVE -- consumer code meets callee requirements (done);
- G_PF_DIRECT_SEND -- provider code meets caller requirements (done);
- G_PF_DIRECT_RECEIVE -- provider code meets callee requirements (request).
Capable GEOM class can set them, allowing direct dispatch in cases where
it is safe. If any of requirements are not met, request is queued to
g_up or g_down thread same as before.

Such GEOM classes were reviewed and updated to support direct dispatch:
CONCAT, DEV, DISK, GATE, MD, MIRROR, MULTIPATH, NOP, PART, RAID, STRIPE,
VFS, ZERO, ZFS::VDEV, ZFS::ZVOL, all classes based on g_slice KPI (LABEL,
MAP, FLASHMAP, etc).

To declare direct completion capability disk(9) KPI got new flag equivalent
to G_PF_DIRECT_SEND -- DISKFLAG_DIRECT_COMPLETION. da(4) and ada(4) disk
drivers got it set now thanks to earlier CAM locking work.

This change more then twice increases peak block storage performance on
systems with manu CPUs, together with earlier CAM locking changes reaching
more then 1 million IOPS (512 byte raw reads from 16 SATA SSDs on 4 HBAs to
256 user-level threads).

Sponsored by: iXsystems, Inc.
MFC after: 2 months


# 248508 19-Mar-2013 kib

Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.

The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.

When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.

Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.

The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.

Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.

In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.

By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.

Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks


# 243333 20-Nov-2012 jh

- Don't pass geom and provider names as format strings.
- Add __printflike() attributes.
- Remove an extra argument for the g_new_geomf() call in swapongeom_ev().

Reviewed by: pjd


# 239790 28-Aug-2012 ed

Remove unneeded G_PF_CANDELETE flag.

This flag is only used by GEOM so it can be propagated to the character
device's SI_CANDELETE. Unfortunately, SI_CANDELETE seems to do nothing.


# 238886 29-Jul-2012 mav

Implement media change notification for DA and CD removable media devices.
It includes three parts:
1) Modifications to CAM to detect media media changes and report them to
disk(9) layer. For modern SATA (and potentially UAS) devices it utilizes
Asynchronous Notification mechanism to receive events from hardware.
Active polling with TEST UNIT READY commands with 3 seconds period is used
for incapable hardware. After that both CD and DA drivers work the same way,
detecting two conditions: "NOT READY: Medium not present" after medium was
detected previously, and "UNIT ATTENTION: Not ready to ready change, medium
may have changed". First one reported to disk(9) as media removal, second
as media insert/change. To reliably receive second event new
AC_UNIT_ATTENTION async added to make UAs broadcasted to all periphs by
generic error handling code in cam_periph_error().
2) Modifications to GEOM core to handle media remove and change events.
Media removal handled by spoiling all consumers attached to the provider.
Media change event also schedules provider retaste after spoiling to probe
new media. New flag G_CF_ORPHAN was added to consumers to reflect that
consumer is in process of destruction. It allows retaste to create new
geom instance of the same class, while previous one is still dying.
3) Modifications to some GEOM classes: DEV -- to report media change
events to devd; VFS -- to handle spoiling same as orphan to prevent
accessing replaced media. PART class already handles spoiling alike to
orphan.

Reviewed by: silence on geom@ and scsi@
Tested by: avg
Sponsored by: iXsystems, Inc. / PC-BSD
MFC after: 2 months


# 238559 17-Jul-2012 ken

Add back spare fields consumed in r237545. It seems that these should only
be consumed to maintain backward compatibility in stable, but should not be
consumed in head.

Submitted by: trasz, attilio (indirectly)


# 238533 16-Jul-2012 trasz

Add back spare fields reused in r238213. According to Attilio, the rule
is to use reuse spares only when MFC-ing, not in CURRENT.


# 238213 07-Jul-2012 trasz

Add a new GEOM method, resize(), which is called after provider size changes.
Add a new routine, g_resize_provider(), to use to notify GEOM about provider
change.

Reviewed by: mav
Sponsored by: FreeBSD Foundation


# 237545 25-Jun-2012 ken

Consume spare fields for the providergone pointers added to the g_class and
g_geom structures in change 237518. The original change would have broken
the ABI.

Suggested by: ae
MFC after: 4 days


# 237518 24-Jun-2012 ken

Fix a bug which causes a panic in daopen(). The panic is caused by
a da(4) instance going away while GEOM is still probing it.

In this case, the GEOM disk class instance has been created by
disk_create(), and the taste of the disk is queued in the GEOM
event queue.

While that event is queued, the da(4) instance goes away. When the
open call comes into the da(4) driver, it dereferences the freed
(but non-NULL) peripheral pointer provided by GEOM, which results
in a panic.

The solution is to add a callback to the GEOM disk code that is
called when all of its resources are cleaned up. This is
implemented inside GEOM by adding an optional callback that is
called when all consumers have detached from a provider, and the
provider is about to be deleted.

scsi_cd.c,
scsi_da.c: In the register routine for the cd(4) and da(4)
routines, acquire a reference to the CAM peripheral
instance just before we call disk_create().

Use the new GEOM disk d_gone() callback to register
a callback (dadiskgonecb()/cddiskgonecb()) that
decrements the peripheral reference count once GEOM
has finished cleaning up its resources.

In the cd(4) driver, clean up open and close
behavior slightly. GEOM makes sure we only get one
open() and one close call, so there is no need to
set an open flag and decrement the reference count
if we are not the first open.

In the cd(4) driver, use cam_periph_release_locked()
in a couple of error scenarios to avoid extra mutex
calls.

geom.h: Add a new, optional, providergone callback that
is called when a provider is about to be deleted.

geom_disk.h: Add a new d_gone() callback to the GEOM disk
interface.

Bump the DISK_VERSION to version 2. This probably
should have been done after a couple of previous
changes, especially the addition of the d_getattr()
callback.

geom_disk.c: Add a providergone callback for the disk class,
g_disk_providergone(), that calls the user's
d_gone() callback if it exists.

Bump the DISK_VERSION to 2.

geom_subr.c: In g_destroy_provider(), call the providergone
callback if it has been provided.

In g_new_geomf(), propagate the class's
providergone callback to the new geom instance.

blkfront.c: Callers of disk_create() are supposed to pass in
DISK_VERSION, not an explicit disk API version
number. Update the blkfront driver to do that.

disk.9: Update the disk(9) man page to include information
on the new d_gone() callback, as well as the
previously added d_getattr() callback, d_descr
field, and HBA PCI ID fields.

MFC after: 5 days


# 224147 17-Jul-2011 pjd

Add some spare fields to the g_class and g_geom structures needed to implement
direct I/O handling and provider's property changes handling.


# 223930 11-Jul-2011 ae

Remove include of sys/sbuf.h from geom/geom.h.
sbuf support is not always required for geom/geom.h users, and no need to
depend from it.

PR: kern/158398


# 223089 14-Jun-2011 gibbs

Plumb device physical path reporting from CAM devices, through GEOM and
DEVFS, and make it accessible via the diskinfo utility.

Extend GEOM's generic attribute query mechanism into generic disk consumers.
sys/geom/geom_disk.c:
sys/geom/geom_disk.h:
sys/cam/scsi/scsi_da.c:
sys/cam/ata/ata_da.c:
- Allow disk providers to implement a new method which can override
the default BIO_GETATTR response, d_getattr(struct bio *). This
function returns -1 if not handled, otherwise it returns 0 or an
errno to be passed to g_io_deliver().

sys/cam/scsi/scsi_da.c:
sys/cam/ata/ata_da.c:
- Don't copy the serial number to dp->d_ident anymore, as the CAM XPT
is now responsible for returning this information via
d_getattr()->(a)dagetattr()->xpt_getatr().

sys/geom/geom_dev.c:
- Implement a new ioctl, DIOCGPHYSPATH, which returns the GEOM
attribute "GEOM::physpath", if possible. If the attribute request
returns a zero-length string, ENOENT is returned.

usr.sbin/diskinfo/diskinfo.c:
- If the DIOCGPHYSPATH ioctl is successful, report physical path
data when diskinfo is executed with the '-v' option.

Submitted by: will
Reviewed by: gibbs
Sponsored by: Spectra Logic Corporation

Add generic attribute change notification support to GEOM.

sys/sys/geom/geom.h:
Add a new attrchanged method field to both g_class
and g_geom.

sys/sys/geom/geom.h:
sys/geom/geom_event.c:
- Provide the g_attr_changed() function that providers
can use to advertise attribute changes.
- Perform delivery of attribute change notifications
from a thread context via the standard GEOM event
mechanism.

sys/geom/geom_subr.c:
Inherit the attrchanged method from class to geom (class instance).

sys/geom/geom_disk.c:
Provide disk_attr_changed() to provide g_attr_changed() access
to consumers of the disk API.

sys/cam/scsi/scsi_pass.c:
sys/cam/scsi/scsi_da.c:
sys/geom/geom_dev.c:
sys/geom/geom_disk.c:
Use attribute changed events to track updates to physical path
information.

sys/cam/scsi/scsi_da.c:
Add AC_ADVINFO_CHANGED to the registered asynchronous CAM
events for this driver. When this event occurs, and
the updated buffer type references our physical path
attribute, emit a GEOM attribute changed event via the
disk_attr_changed() API.

sys/cam/scsi/scsi_pass.c:
Add AC_ADVINFO_CHANGED to the registered asynchronous CAM
events for this driver. When this event occurs, update
the physical patch devfs alias for this pass instance.

Submitted by: gibbs
Sponsored by: Spectra Logic Corporation


# 221101 26-Apr-2011 mav

Implement relaxed comparision for hardcoded provider names to make it
ignore adX/adaY difference in both directions to simplify migration to
the CAM-based ATA or back.


# 219970 24-Mar-2011 mav

MFgraid/head r218212, r218257:
Introduce new type of BIO_GETATTR -- GEOM::setstate, used to inform lower
GEOM about state of it's providers from the point of upper layers.
Make geom_disk use led(4) subsystem to illuminate states in such fashion:
FAILED - "1" (on), REBUILD - "f5" (slow blink), RESYNC - "f1" (fast blink),
ACTIVE - "0" (off).
LED name should be set for each disk via kern.geom.disk.%s.led sysctl.
Later disk API could be extended to allow disk driver to report this info
in custom way via it's own facilities.


# 219950 24-Mar-2011 mav

MFgraid/head r217827:
Change BIO_GETATTR("GEOM::kerneldump") API to make set_dumper() called by
consumer (geom_dev) instead of provider (geom_disk). This allows any geom
insert it's code into the dump call chain, implementing more sophisticated
functionality then just disk partitioning.


# 207671 05-May-2010 jh

Fix deadlock between GEOM class unloading and withering. Withering can't
proceed while g_unload_class() blocks the event thread. Fix this by not
running g_unload_class() as a GEOM event and dropping the topology lock
when withering needs to proceed.

PR: kern/139847
Silence on: freebsd-geom


# 195195 30-Jun-2009 trasz

Make gjournal work with kernel compiled with "options DIAGNOSTIC".
Previously, it would panic immediately.

Reviewed by: pjd
Approved by: re (kib)


# 193981 11-Jun-2009 luigi

As discussed in the devsummit, introduce two fields in the
struct bio to store classification information, and a hook
for classifier functions that can be called by g_io_request().

This code is from Fabio Checconi as part of his GSOC work.


# 190878 10-Apr-2009 thompsa

Revert r190676,190677

The geom and CAM changes for root_hold are the wrong solution for USB design
quirks.

Requested by: scottl


# 190677 03-Apr-2009 thompsa

Add interleaving root hold tokens from the CAM probe to disk_create and geom
provider tasting. This is needed for disk attachments that happen after threads
are running in the boot process.

Tested by: rnoland


# 187973 31-Jan-2009 marcel

Constify val in g_handleattr() and str in g_handleattr_str().
This allows passing string constants to g_handleattr_str().


# 177509 22-Mar-2008 marcel

Add g_retaste(), which given a class will present all non-open providers
to it for tasting. This is useful when the class, through means outside
the scope of GEOM, can claim providers previously unclaimed.

The g_retaste() function posts an event which is handled by the
g_retaste_event().

Event suggested by: phk


# 169283 05-May-2007 pjd

Implement g_delete_data() similar to g_read_data() and g_write_data().

OK'ed by: phk


# 169282 05-May-2007 pjd

- Implement helper g_handleattr_str() function for string attributes
handling.
- Extend g_handleattr() to treat attribute as string when len=0.

OK'ed by: phk


# 163832 31-Oct-2006 pjd

Add a new I/O request - BIO_FLUSH, which basically tells providers below to
flush their caches. For now will mostly be used by disks to flush their
write cache.

Sponsored by: home.pl


# 162352 16-Sep-2006 pjd

Add __printflike() to gctl_error().

Approved by: phk
MFC after: 1 week


# 162326 15-Sep-2006 pjd

Add 'show geom [addr]' ddb(4) command, which prints entire GEOM topology if
no additional argument is given or details about the given GEOM object
(class, geom, provider or consumer).

Approved by: phk


# 160301 12-Jul-2006 pjd

Only check if we're freeing a valid object if we hold the topology lock.
This prevents panic under heavy load with DIAGNOSTIC compiled in.


# 159304 05-Jun-2006 pjd

Add g_duplicate_bio() function which does the same thing what g_clone_bio()
is doing, but g_duplicate_bio() allocates new bio with M_WAITOK flag.


# 157619 10-Apr-2006 marcel

Add g_wither_provider() to abstract the details of destroying a
particular provider. Use this function where g_orphan_provider()
is being called so that the flags are updated correctly and
g_orphan_provider() is called only when allowed.


# 157581 07-Apr-2006 marcel

Change gctl_set_param() to return an error instead of setting an
error on the request. Add a wrapper, gctl_set_param_err(), that
sets the error on the request from the error returned by
gctl_set_param() and update current callers of gctl_set_param()
to call gctl_set_param_err() instead.
This makes gctl_set_param() much more usable in situations where
the caller knows better what to do with certain (apparent) error
conditions and setting an error on the request is not one of the
things that need to be done.


# 149757 03-Sep-2005 phk

Typo.


# 139139 21-Dec-2004 pjd

Implement g_topology_try_lock().

No objection from: phk


# 138732 12-Dec-2004 phk

Pass the file->flags down to geom ioctl handlers.

Reject certain ioctls if write permission is not indicated.

Bump geom API version.

Reported by: Ruben de Groot <mail25@bzerk.org>


# 137489 09-Nov-2004 pjd

Introduce g_waitidlelock() function which is simlar to g_waitidle(),
but should be called with the topology lock held and returns with the
topology lock held and empty event queue.

Approved by: phk (sometime ago)


# 137032 29-Oct-2004 phk

Add g_wither_geom_close() function.


# 136836 23-Oct-2004 phk

Move the prototype for g_waitidle() to a more visible place.


# 134379 27-Aug-2004 phk

Introduce g_alloc_bio() as a waiting variant of g_new_bio().

Use in places where we can sleep and where we previously failed to check
for a NULL pointer.

MT5 candidate.


# 133312 08-Aug-2004 phk

Give classes a version number and refuse to touch classes which are not
understood. This makes room for additional binary compatibility in the
future.

Put fields in the class for the geom's methods and initialize the methods
of a new geom from these fields. This saves some code in all classes.


# 130875 21-Jun-2004 phk

Kill g_access_rel() already now before we send it down 5-stable


# 130585 16-Jun-2004 phk

Do the dreaded s/dev_t/struct cdev */
Bump __FreeBSD_version accordingly.


# 126798 10-Mar-2004 phk

Rearrange some of the GEOM debugging tools to be more structured.

Retire g_sanity() and corresponding debugflag (0x8)

Retire g_{stall,release}_events().

Under #ifdef DIAGNOSTIC:

Make g_valid_obj() an official function and have it return an an
non-zero integer which indicates the kind of object when found.

Implement G_VALID_{CLASS,GEOM,CONSUMER,PROVIDER}() macros based
on g_valid_obj().

Sprinkle calls to these macros liberally over the infrastructure.

Always check that we do not free a live object.


# 125755 12-Feb-2004 phk

Remove the absolute count g_access_abs() function since experience has
shown that it is not useful.

Rename the relative count g_access_rel() function to g_access(), only
the name has changed.

Change all g_access_rel() calls in our CVS tree to call g_access() instead.

Add an #ifndef BURN_BRIDGES #define of g_access_rel() for source
code compatibility.


# 125743 12-Feb-2004 phk

Give both consumers and providers a {void *private, u_int index} which
the implementing class can use to hang internal info from.


# 125713 11-Feb-2004 pjd

Added g_print_bio() function to print informations about given bio.

Approved by: phk, scottl (mentor)


# 125656 10-Feb-2004 pjd

Added macro which will be used to assert, that the topology lock is not held.

Approved by: phk, scottl (mentor)


# 123233 07-Dec-2003 phk

KASSERT against multiple orphanings of providers.


# 123215 07-Dec-2003 scottl

Re-arrange and consolidate some random debugging stuff


# 120851 06-Oct-2003 phk

Introduce a per provider wither flag


# 119660 01-Sep-2003 phk

Simplify the ioctl handling in GEOM.

This replaces the current ioctl processing with a direct call path
from geom_dev() where the ioctl arrives (from SPECFS) to any directly
connected GEOM class.

The inverse of the above is no longer supported. This is the
situation were you have one or more intervening GEOM classes, for
instance a BSDlabel on top of a MBR or PC98. If you want to issue
MBR or PC98 specific ioctls, you will need to issue them on a MBR
or PC98 providers.

This paves the way for inviting CD's, FD's and other special cases
inside GEOM.


# 119593 30-Aug-2003 phk

Add the new g_dev_getprovider() function, the swap_pager needs it now.

Spotted by: mr


# 115960 07-Jun-2003 phk

Improve the root-dev prompt facility for printing devices which could
possibly be a root filesystem.


# 115951 07-Jun-2003 phk

Drop a memory-corruption debugging test-tool.


# 115850 04-Jun-2003 phk

Introduce g_provider_by_name() function, and use it.


# 115624 01-Jun-2003 phk

Simplify the GEOM OAM api: Drop the request type, and let everything
hinge on the "verb" parameter which the class gets to interpret as
it sees fit.

Move the entire request into the kernel and move changed parameters
back when done.


# 115623 01-Jun-2003 phk

constify g_sanity()


# 115473 31-May-2003 phk

Introduce a init and fini member functions on a class.

Use ->init() and ->fini() to handle the mutex in geom_disk.c

Remove the g_add_class() function and replace it with a standardized
g_modevent() function.

This adds the basic infrastructure for loading/unloading GEOM classes


# 115468 31-May-2003 phk

Remove the G_CLASS_INITIALIZER, we do not need it anymore.


# 114670 04-May-2003 phk

Add gctl_set_param() function.


# 114495 02-May-2003 phk

Rework the "withering" mechanism:

Introduce g_wither_geom() to do the work in one single place.


# 114455 01-May-2003 phk

Remove the now obsolete geomidorname hack.


# 113940 23-Apr-2003 phk

Introduce a g_waitfor_event() function which posts an event and waits for
it to be run (or cancelled) and use this instead of home-rolled versions.


# 113937 23-Apr-2003 phk

Rename g_call_me() to g_post_event(), and give it a flag
argument to determine if we can M_WAITOK in malloc.


# 113892 23-Apr-2003 phk

Make gctl_error() take printfline varargs.


# 113889 23-Apr-2003 phk

Remove unused event pointers in object structures.
Remove KASSERTS which checked that they were unused.


# 113876 22-Apr-2003 phk

Implement handling of CONFIG_GEOM OAM request.


# 113432 13-Apr-2003 phk

Time has run from the "run GEOM in userland" harness, and the new regression
test is built to test GEOM as running in the kernel.

This commit is basically "unifdef -D_KERNEL" to remove the mainly #include
related code to support the userland-harness.


# 113032 03-Apr-2003 phk

Remove all references to BIO_SETATTR. We will not be using it.


# 113012 03-Apr-2003 phk

Remove geom_enc.c, a superset of these functions are now available in
<sys/endian.h>


# 112989 02-Apr-2003 phk

Add handling for cancelled events in the g_call_me() methods.


# 112988 02-Apr-2003 phk

Change events to have an array of "void *" references, and give the
event posting functions varargs to fill these.

Attribute g_call_me() to appropriate g_geom's where necessary.

Add a flag argument to g_call_me() methods which will be used to signal
cancellation of events in the future.

This commit should be a no-op.


# 112927 01-Apr-2003 phk

Remove the old config interface, the new OAM is sufficiently functional
now.


# 112709 27-Mar-2003 phk

Run a revision on the OAM api.

Use prefix gctl_ systematically.
Add flag with access perms for each argument.
Add ro/rw versions of argument building functions.
General cleanup.


# 112595 25-Mar-2003 phk

Remove unuse g_insert_geom().


# 112552 24-Mar-2003 phk

Premptively change initializations of struct g_class to use C99
sparse struct initializations before we extend the struct with
new OAM related member functions.


# 112370 18-Mar-2003 phk

Retire the GEOM private statistics code and use devstat instead.


# 112026 09-Mar-2003 phk

Add u_int nstart, nend counters to consumer and providers so we will not
have to examine the stats structure to tell if we have outstanding I/O
requests.

Making them u_int improves the chance of atomic updates to them,
but risks roll-over. Since the only interesting property is if
they are equal or not, this is not an issue.


# 110710 11-Feb-2003 phk

Better names for struct disk elements: d_maxsize, d_stripeoffset
and d_stripesisze;

Introduce si_stripesize and si_stripeoffset in struct cdev so we
can make the visible to clustering code.

Add stripesize and stripeoffset to providers.

DTRT with stripesize and stripeoffset in various places in GEOM.


# 110690 11-Feb-2003 phk

Introduce flag field and G_PF_CANDELETE field on providers.


# 110541 08-Feb-2003 phk

Move the g_stat struct to its own .h file, we will export it to other code.

Insted of embedding a struct g_stat in consumers and providers, merely
include a pointer.

Remove a couple of <sys/time.h> includes now unneeded.

Add a special allocator for struct g_stat. This allocator will allocate
entire pages and hand out g_stat functions from there. The "id" field
indicates free/used status.

Add "/dev/geom.stats" device driver whic exports the pages from the
allocator to userland with mmap(2) in read-only mode.

This mmap(2) interface should be considered a non-public interface and
the functions in libgeom (not yet committed) should be used to access
the statistics data.


# 110523 07-Feb-2003 phk

Commit the correct copy of the g_stat structure.

Add debug.sizeof.g_stat sysctl.

Set the id field of the g_stat when we create consumers and providers.

Remove biocount from consumer, we will use the counters in the g_stat
structure instead. Replace one field which will need to be atomically
manipulated with two fields which will not (stat.nop and stat.nend).

Change add companion field to bio_children: bio_inbed for the exact
same reason.

Don't output the biocount in the confdot output.

Fix KASSERT in g_io_request().

Add sysctl kern.geom.collectstats defaulting to off.

Collect the following raw statistics conditioned on this sysctl:

for each consumer and provider {
total number of operations started.
total number of operations completed.
time last operation completed.
sum of idle-time.
for each of BIO_READ, BIO_WRITE and BIO_DELETE {
number of operations completed.
number of bytes completed.
number of ENOMEM errors.
number of other errors.
sum of transaction time.
}
}

API for getting hold of these statistics data not included yet.


# 110518 07-Feb-2003 phk

Add the new statistics structure, put one in consumers and providers.
include <sys/time.h> as necessary.


# 109972 28-Jan-2003 phk

Use a void * to carry the private data for return-call'ed ioctl requests.
Amongst other things this avoids a complex workaround in the userland
regression bits.


# 109170 13-Jan-2003 phk

Remove g_silence(). It does not do anything anymore.


# 107953 16-Dec-2002 phk

Constification and some s/int/u_int/ changes.


# 107453 01-Dec-2002 phk

Use more mnemonic argument names in the access functions.

Sponsored by: DARPA & NAI Labs
Approved by: re (blanket)


# 106518 06-Nov-2002 phk

Straighten up the geom.ctl config interface definitions.

Sponsored by: DARPA & NAI Labs


# 105947 25-Oct-2002 phk

Add a g_dev_print() function which prints all the /dev entries GEOM
know about.


# 105542 20-Oct-2002 phk

Make the sectorsize a property of providers so we can include it in the XML
output.

Sponsored by: DARPA & NAI Labs


# 105504 20-Oct-2002 phk

Make it possible to specify also via geom_t ID in the geom.ctl config ioctl.

Sponsored by: DARPA & NAI Labs.


# 105163 15-Oct-2002 phk

Constification ? Yes, out that door, row on the left, one patch each.

Sponsored by: DARPA & NAI Labs


# 105124 14-Oct-2002 jake

Moved geom class initialization to SI_SUB_DRIVERS from SI_SUB_PSEUDO.
This fixes mounting root from md(4) which calls disk_create() early.


# 105092 14-Oct-2002 phk

Implement the GEOMCONFIGGEOM ioctl which can be used to manually create
and configure an instance of a class on a give provider.

Sponsored by: DARPA & NAI Labs


# 105068 13-Oct-2002 phk

Add the outline of the "/dev/geom.ctl" handling code.

Sponsored by: DARPA & NAI Labs.


# 105061 13-Oct-2002 phk

Give GEOM modules a chance to specify their own init routine, in case they
have special requirements.

Sponsored by: DARPA & NAI Labs.


# 104602 07-Oct-2002 phk

Copyin and copyout are only possible from a process-native thread,
and therefore we need a way for ioctl handlers to run in that thread
in GEOM. Rather than invent a complicated registration system to
recognize which ioctl handler to use for a given ioctl, we still
schedule all ioctls down the tree as bio transactions but add a
special return code that means "call me directly" and have the
geom_dev layer do that.

Use this for all ioctls that make it as far as a diskdriver to
avoid any backwards compatibility problems.

Requested by: scottl
Sponsored by: DARPA & NAI Labs


# 104195 30-Sep-2002 phk

Retire g_io_fail() and let g_io_deliver() take an error argument instead.

Sponsored by: DARPA & NAI Labs.


# 104194 30-Sep-2002 phk

Introduce g_write_data() function.

Sponsored by: DARPA & NAI Labs


# 104193 30-Sep-2002 phk

Add missing g_enc_le2().

Sponsored by: DARPA & NAI Labs.


# 104060 27-Sep-2002 phk

Various no-ops:

Add a __unused.

Make the 2byte decoder functions return 16 bits for the benefits
of picky lints.

No need to grab giant around a tsleep() when we have a timeout.

Sponsored by: DARPA & NAI Labs.


# 104056 27-Sep-2002 phk

Implement g_call_me() as a way for geom methods to schedule operations
to be performed in the event-thread.

To do this, we need to lock the eventlist with g_eventlock (nee g_doorlock),
since g_call_me() being called from the UP/DOWN paths will not be able to
aquire g_topology_lock.

This also means that for now these events are not referenced on any
particular consumer/provider/geom.

For UP/DOWN path use, this will not become a problem since the access()
function will make sure we drain any bio's before we dismantle.

Sponsored by: DARPA & NAI Labs.


# 103279 13-Sep-2002 phk

Add a couple more of the big/little-endian conversion routines and make
them visible from userland, if need be.

I wish that the C language contained this as part of struct definintions,
but failing that, I would settle for an agreed upon set of functions for
packing/unpacking integers in various sizes from byte-streams which may
have unfriendly alignment.

This really belongs in <sys/endian.h> I guess.


# 103009 06-Sep-2002 phk

Remove "magicspace". It looks good on paper, it doesn't work in practice.

Sponsored by: DARPA & NAI Labs.


# 98066 09-Jun-2002 phk

Improve some on the naming.

Submitted by: iedowse


# 97887 05-Jun-2002 phk

Change the registration of magic spaces so it does its own memory management.

Sponsored by: DARPA & NAI Labs.


# 97078 21-May-2002 phk

Introduce the concept of "magic spaces", and implement them in most of
the relevant classes.

Some methods may implement various "magic spaces", this is reserved
or magic areas on the disk, set a side for various and sundry purposes.
A good example is the BSD disklabel and boot code on i386 which occupies
a total of four magic spaces: boot1, the disklabel, the padding behind
the disklabel and boot2. The reason we don't simply tell people to
write the appropriate stuff on the underlying device is that (some of)
the magic spaces might be real-time modifiable. It is for instance
possible to change a disklabel while partitions are open, provided
the open partitions do not get trampled in the process.

Sponsored by: DARPA & NAI Labs.


# 96987 20-May-2002 phk

Don't grab Giant around malloc(9) and free(9).
Don't grab Giant around wakeup(9).
Don't print verbose messages about each device found in geom_dev.
Various cleanups.

Sponsored by: DARPA & NAI Labs.


# 95362 24-Apr-2002 phk

Make specific provisions for the kernel simulator used in the regression
tests, other userland programs may need to include <geom/geom.h>.

Sponsored by: DARPA & NAI Labs.


# 95323 23-Apr-2002 phk

Implement the GEOMGETCONF ioctl which returns vital stats for the
current device in XML in an sbuf.

Sponsored by: DARPA & NAI Labs


# 95310 23-Apr-2002 phk

Introduce some serious paranoia to try to catch a memory overwrite problem
as early as possible.

Sponsored by: DARPA & NAI Labs


# 95276 22-Apr-2002 phk

Protect against multitple #includes of this file.


# 95038 19-Apr-2002 phk

Make kernel dumps work with GEOM.

Notice that if the device on which the dump is set is destroyed for
any reason, the dump setting is lost. This in particular will
happen in the case of spoilage. For instance if you set dump on
ad0s1b and open ad0 for writing, ad0s* will be spoilt and the dump
setting lost. See geom(4) for more about spoiling.

Sponsored by: DARPA & NAI Labs.


# 94284 09-Apr-2002 phk

Introduce the convenience function g_getattr() and make it DWIM.

Sponsored by: DARPA & NAI Labs.


# 94283 09-Apr-2002 phk

Constifixation of attribute argument to g_io_[gs]etattr()

Sponsored by: DARPA & NAI Labs


# 93778 04-Apr-2002 phk

Centralize EOF handling and improve access controls for bio scheduling.

Sponsored by: DARPA & NAI Labs


# 93776 04-Apr-2002 phk

Move access and orphan member functions from class to geom.

Sponsored by: DARPA & NAI Labs


# 93250 26-Mar-2002 phk

Eliminate some thread pointers which do not make sense anymore.

Split private parts of geom.h into geom_int.h. The latter should
never be included in class implemtations.


# 93248 26-Mar-2002 phk

Cave in to tradition and rename "methods" to "classes".


# 93090 24-Mar-2002 phk

Be more systematic about conversion of on-disk formats in a endian/width
agnostic way.

Collapse the MBR and MBREXT methods into one file and make them endian/width
agnostic.

Sponsored by: DARPA & NAI Labs.


# 92514 17-Mar-2002 phk

Need a different #include for the userland regression test.


# 92403 16-Mar-2002 phk

Add a generic and general ioctl pass-through mechanism.

It should now be posible to issue ioctls to SCSI CD drives.


# 92108 11-Mar-2002 phk

First commit of the GEOM subsystem to make it easier for people to
test and play with this.

This is not yet production quality and should be run only on dedicated
test boxes.

For people who want to develop transformations for GEOM there exist a
set of shims to run geom in userland (ask phk@freebsd.org).

Reports of all kinds to: phk@freebsd.org
Please include in report:
dmesg
sysctl debug.geomdot
sysctl debug.geomconf

Known significant limitations:
no kernel dump facility.
ioctls severely restricted.

Sponsored by: DARPA, NAI Labs