History log of /freebsd-10.3-release/sys/kern/vfs_aio.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 296373 04-Mar-2016 marius

- Copy stable/10@296371 to releng/10.3 in preparation for 10.3-RC1
builds.
- Update newvers.sh to reflect RC1.
- Update __FreeBSD_version to reflect 10.3.
- Update default pkg(8) configuration to use the quarterly branch.

Approved by: re (implicit)

# 282569 06-May-2015 mav

MFC r281860: Make AIO to not allocate pbufs for unmapped I/O like r281825.

While there, make few more performance optimizations.

On 40-core system doing many 512-byte AIO reads from array of raw SSDs
this change removes lock congestions inside pbuf allocator and devfs,
and bottleneck on single AIO completion taskqueue thread. It improves
peak AIO performance from ~600K to ~1.3M IOPS.


# 280258 19-Mar-2015 rwatson

Merge r263233 from HEAD to stable/10:

Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

Sponsored by: Google, Inc.


# 256281 10-Oct-2013 gjb

Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


# 255241 05-Sep-2013 pjd

The fget() function now takes pointer to cap_rights_t, so change 0 to NULL.


# 255219 05-Sep-2013 pjd

Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)

#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);

bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

cap_rights_t rights;

cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by: The FreeBSD Foundation


# 254389 15-Aug-2013 ken

Change the way that unmapped I/O capability is advertised.

The previous method was to set the D_UNMAPPED_IO flag in the cdevsw
for the driver. The problem with this is that in many cases (e.g.
sa(4)) there may be some instances of the driver that can handle
unmapped I/O and some that can't. The isp(4) driver can handle
unmapped I/O, but the esp(4) driver currently cannot. The cdevsw
is shared among all driver instances.

So instead of setting a flag on the cdevsw, set a flag on the cdev.
This allows drivers to indicate support for unmapped I/O on a
per-instance basis.

sys/conf.h: Remove the D_UNMAPPED_IO cdevsw flag and replace it
with an SI_UNMAPPED cdev flag.

kern_physio.c: Look at the cdev SI_UNMAPPED flag to determine
whether or not a particular driver can handle
unmapped I/O.

geom_dev.c: Set the SI_UNMAPPED flag for all GEOM cdevs.
Since GEOM will create a temporary mapping when
needed, setting SI_UNMAPPED unconditionally will
work.

Remove the D_UNMAPPED_IO flag.

nvme_ns.c: Set the SI_UNMAPPED flag on cdevs created here
if NVME_UNMAPPED_BIO_SUPPORT is enabled.

vfs_aio.c: In aio_qphysio(), check the SI_UNMAPPED flag on a
cdev instead of the D_UNMAPPED_IO flag on the cdevsw.

sys/param.h: Bump __FreeBSD_version to 1000045 for the switch from
setting the D_UNMAPPED_IO flag in the cdevsw to setting
SI_UNMAPPED in the cdev.

Reviewed by: kib, jimharris
MFC after: 1 week
Sponsored by: Spectra Logic


# 253885 02-Aug-2013 glebius

Remove extra zeroing after M_ZERO allocation.


# 253527 21-Jul-2013 kib

Move the convert_sigevent32() utility function into freebsd32_misc.c
for consumption outside the vfs_aio.c.

For SIGEV_THREAD_ID and SIGEV_SIGNAL notification delivery methods,
also copy in the sigev_value, since librt event pumping loop compares
note generation number with the value passed through sigev_value.

Tested by: Petr Salinger <Petr.Salinger@seznam.cz>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 251526 08-Jun-2013 glebius

Add new system call - aio_mlock(). The name speaks for itself. It allows
to perform the mlock(2) operation, which can consume a lot of time, under
control of aio(4).

Reviewed by: kib, jilles
Sponsored by: Nginx, Inc.


# 251522 08-Jun-2013 glebius

Separate LIO_SYNC processing into a separate function aio_process_sync(),
and rename aio_process() into aio_process_rw().

Reviewed by: kib
Sponsored by: Nginx, Inc.


# 248794 27-Mar-2013 kib

Fix a race with the vnode reclamation in the aio_qphysio(). Obtain
the thread reference on the vp->v_rdev and use the returned struct
cdev *dev instead of using vp->v_rdev. Call dev_strategy_csw()
instead of dev_strategy(), since we now own the reference.

Since the csw was already calculated, test d_flags to avoid mapping
the buffer if the driver supports unmapped requests [*].

Suggested by: kan [*]
Reviewed by: kan (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 248515 19-Mar-2013 kib

Do not remap usermode pages into KVA for physio.

Sponsored by: The FreeBSD Foundation
Tested by: pho


# 248084 09-Mar-2013 attilio

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


# 247602 02-Mar-2013 pjd

Merge Capsicum overhaul:

- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:

CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.

Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).

Added CAP_SYMLINKAT:
- Allow for symlinkat(2).

Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).

Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.

Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.

Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.

Removed CAP_MAPEXEC.

CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.

Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.

CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).

CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).

Added convinient defines:

#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE

#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)

Added defines for backward API compatibility:

#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib


# 241896 22-Oct-2012 kib

Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho


# 231006 05-Feb-2012 davidxu

Add 32-bit compat code for AIO kevent flags introduced in revision 230857.


# 230857 01-Feb-2012 davidxu

If multiple threads call kevent() to get AIO events on same kqueue fd,
it is possible that a single AIO event will be reported to multiple
threads, it is not threading friendly, and the existing API can not
control this behavior.
Allocate a kevent flags field sigev_notify_kevent_flags for AIO event
notification in sigevent, and allow user to pass EV_CLEAR, EV_DISPATCH
or EV_ONESHOT to AIO kernel code, user can control whether the event
should be cleared once it is retrieved by a thread. This change should
be comptaible with existing application, because the field should have
already been zero-filled, and no additional action will be taken by
kernel.

PR: kern/156567


# 230778 30-Jan-2012 ambrisko

When detaching an AIO or LIO requests grab the lock and tell knlist_remove
that we have the lock now. This cleans up a locking panic ASSERT when
knlist_empty is called without a lock when INVARIANTS etc. are turned.

Reviewed by: kib jhb
MFC after: 1 week


# 230610 27-Jan-2012 glebius

Fix size check, that prevents getting negative after casting
to a signed type

Reviewed by: bde


# 230583 26-Jan-2012 glebius

Although aio_nbytes is size_t, later is is signed to
casted types: to ssize_t in filesystem code and to
int in buf code, thus supplying a negative argument
leads to kernel panic later. To fix that check user
supplied argument in the beginning of syscall.

Submitted by: Maxim Dounin <mdounin mdounin.ru>, maxim@


# 225617 16-Sep-2011 kmacy

In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)


# 224778 11-Aug-2011 rwatson

Second-to-last commit implementing Capsicum capabilities in the FreeBSD
kernel for FreeBSD 9.0:

Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *. With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.

Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.

In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.

Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.

Approved by: re (bz)
Submitted by: jonathan
Sponsored by: Google Inc


# 213642 09-Oct-2010 davidxu

Create a global thread hash table to speed up thread lookup, use
rwlock to protect the table. In old code, thread lookup is done with
process lock held, to find a thread, kernel has to iterate through
process and thread list, this is quite inefficient.
With this change, test shows in extreme case performance is
dramatically improved.

Earlier patch was reviewed by: jhb, julian


# 205326 19-Mar-2010 kib

Convert aio syscall registration to SYSCALL_INIT_HELPER.

Reviewed by: jhb
MFC after: 2 weeks


# 205014 11-Mar-2010 nwhitehorn

Provide groundwork for 32-bit binary compatibility on non-x86 platforms,
for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32
option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts
of the kernel and enhances the freebsd32 compatibility code to support
big-endian platforms.

Reviewed by: kib, jhb


# 197134 12-Sep-2009 rwatson

Use C99 initialization for struct filterops.

Obtained from: Mac OS X
Sponsored by: Apple Inc.
MFC after: 3 weeks


# 193951 10-Jun-2009 kib

Adapt vfs kqfilter to the shared vnode lock used by zfs write vop. Use
vnode interlock to protect the knote fields [1]. The locking assumes
that shared vnode lock is held, thus we get exclusive access to knote
either by exclusive vnode lock protection, or by shared vnode lock +
vnode interlock.

Do not use kl_locked() method to assert either lock ownership or the
fact that curthread does not own the lock. For shared locks, ownership
is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared
lock not owned by curthread, causing false positives in kqueue subsystem
assertions about knlist lock.

Remove kl_locked method from knlist lock vector, and add two separate
assertion methods kl_assert_locked and kl_assert_unlocked, that are
supposed to use proper asserts. Change knlist_init accordingly.

Add convenience function knlist_init_mtx to reduce number of arguments
for typical knlist initialization.

Submitted by: jhb [1]
Noted by: jhb [2]
Reviewed by: jhb
Tested by: rnoland


# 193272 01-Jun-2009 jhb

Rework socket upcalls to close some races with setup/teardown of upcalls.
- Each socket upcall is now invoked with the appropriate socket buffer
locked. It is not permissible to call soisconnected() with this lock
held; however, so socket upcalls now return an integer value. The two
possible values are SU_OK and SU_ISCONNECTED. If an upcall returns
SU_ISCONNECTED, then the soisconnected() will be invoked on the
socket after the socket buffer lock is dropped.
- A new API is provided for setting and clearing socket upcalls. The
API consists of soupcall_set() and soupcall_clear().
- To simplify locking, each socket buffer now has a separate upcall.
- When a socket upcall returns SU_ISCONNECTED, the upcall is cleared from
the receive socket buffer automatically. Note that a SO_SND upcall
should never return SU_ISCONNECTED.
- All this means that accept filters should now return SU_ISCONNECTED
instead of calling soisconnected() directly. They also no longer need
to explicitly clear the upcall on the new socket.
- The HTTP accept filter still uses soupcall_set() to manage its internal
state machine, but other accept filters no longer have any explicit
knowlege of socket upcall internals aside from their return value.
- The various RPC client upcalls currently drop the socket buffer lock
while invoking soreceive() as a temporary band-aid. The plan for
the future is to add a new flag to allow soreceive() to be called with
the socket buffer locked.
- The AIO callback for socket I/O is now also invoked with the socket
buffer locked. Previously sowakeup() would drop the socket buffer
lock only to call aio_swake() which immediately re-acquired the socket
buffer lock for the duration of the function call.

Discussed with: rwatson, rmacklem


# 187631 23-Jan-2009 jhb

Use the correct type for the timeout parameter to the 32-bit
compat version aio_waitcomplete().

Reminded by: bz
Submitted by: jamie
MFC after: 3 days


# 185878 10-Dec-2008 jhb

- Add 32-bit compat system calls for VFS_AIO. The system calls live in the
aio code and are registered via the recently added SYSCALL32_*() helpers.
- Since the aio code likes to invoke fuword and suword a lot down in the
"bowels" of system calls, add a structure holding a set of operations for
things like storing errors, copying in the aiocb structure, storing
status, etc. The 32-bit system calls use a separate operations vector to
handle fuword32 vs fuword, etc. Also, the oldsigevent handling is now
done by having seperate operation vectors with different aiocb copyin
routines.
- Split out kern_foo() functions for the various AIO system calls so the
32-bit front ends can manage things like copying in and converting
timespec structures, etc.
- For both the native and 32-bit aio_suspend() and lio_listio() calls,
just use copyin() to read the array of aiocb pointers instead of using
a for loop that iterated over fuword/fuword32. The error handling in
the old case was incomplete (lio_listio() just ignored any aiocb's that
it got an EFAULT trying to read rather than reporting an error), and
possibly slower.

MFC after: 1 month


# 179908 21-Jun-2008 gonzo

Use minimum of max_aio_procs and target_aio_procs when spawning new
aiod since there should be no more then max_aio_procs processes.


# 175870 01-Feb-2008 rwatson

Use FEATURE() macro to advertise aio availability.


# 175642 24-Jan-2008 dumbbell

When asked to use kqueue, AIO stores its internal state in the
`kn_sdata' member of the newly registered knote. The problem is that
this member is overwritten by a call to kevent(2) with the EV_ADD flag,
targetted at the same kevent/knote. For instance, a userland application
may set the pointer to NULL, leading to a panic.

A testcase was provided by the submitter.

PR: kern/118911
Submitted by: MOROHOSHI Akihiko <moro@remus.dti.ne.jp>
MFC after: 1 day


# 175294 13-Jan-2008 attilio

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


# 175202 10-Jan-2008 attilio

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


# 172836 20-Oct-2007 julian

Rename the kthread_xxx (e.g. kthread_create()) calls
to kproc_xxx as they actually make whole processes.
Thos makes way for us to add REAL kthread_create() and friends
that actually make theads. it turns out that most of these
calls actually end up being moved back to the thread version
when it's added. but we need to make this cosmetic change first.

I'd LOVE to do this rename in 7.0 so that we can eventually MFC the
new kthread_xxx() calls.


# 171901 20-Aug-2007 kib

Destroy the kaio_mtx on the freeing the struct kaioinfo in the
aio_proc_rundown.

Do not allow for zero-length read to be passed to the fo_read file method
by aio.

Reported and tested by: Peter Holm
Approved by: re (kensmith)


# 170483 10-Jun-2007 mjacob

Remove unused variable.


# 170174 01-Jun-2007 jeff

- Move rusage from being per-process in struct pstats to per-thread in
td_ru. This removes the requirement for per-process synchronization in
statclock() and mi_switch(). This was previously supported by
sched_lock which is going away. All modifications to rusage are now
done in the context of the owning thread. reads proceed without locks.
- Aggregate exiting threads rusage in thread_exit() such that the exiting
thread's rusage is not lost.
- Provide a new routine, rufetch() to fetch an aggregate of all rusage
structures from all threads in a process. This routine must be used
in any place requiring a rusage from a process prior to it's exit. The
exited process's rusage is still available via p_ru.
- Aggregate tick statistics only on demand via rufetch() or when a thread
exits. Tick statistics are kept in the thread and protected by sched_lock
until it exits.

Initial patch by: attilio
Reviewed by: attilio, bde (some objections), arch (mostly silent)


# 167232 05-Mar-2007 rwatson

Further system call comment cleanup:

- Remove also "MP SAFE" after prior "MPSAFE" pass. (suggested by bde)
- Remove extra blank lines in some cases.
- Add extra blank lines in some cases.
- Remove no-op comments consisting solely of the function name, the word
"syscall", or the system call name.
- Add punctuation.
- Re-wrap some comments.


# 164184 11-Nov-2006 trhodes

Merge posix4/* into normal kernel hierarchy.

Reviewed by: glanced at by jhb
Approved by: silence on -arch@ and -standards@


# 163379 15-Oct-2006 netchild

MFP4 (with some minor changes):

Implement the linux_io_* syscalls (AIO). They are only enabled if the native
AIO code is available (either compiled in to the kernel or as a module) at
the time the functions are used. If the AIO stuff is not available there
will be a ENOSYS.

From the submitter:
---snip---
DESIGN NOTES:

1. Linux permits a process to own multiple AIO queues (distinguished by
"context"), but FreeBSD creates only one single AIO queue per process.
My code maintains a request queue (STAILQ of queue(3)) per "context",
and throws all AIO requests of all contexts owned by a process into
the single FreeBSD per-process AIO queue.

When the process calls io_destroy(2), io_getevents(2), io_submit(2) and
io_cancel(2), my code can pick out requests owned by the specified context
from the single FreeBSD per-process AIO queue according to the per-context
request queues maintained by my code.

2. The request queue maintained by my code stores contrast information between
Linux IO control blocks (struct linux_iocb) and FreeBSD IO control blocks
(struct aiocb). FreeBSD IO control block actually exists in userland memory
space, required by FreeBSD native aio_XXXXXX(2).

3. It is quite troubling that the function io_getevents() of libaio-0.3.105
needs to use Linux-specific "struct aio_ring", which is a partial mirror
of context in user space. I would rather take the address of context in
kernel as the context ID, but the io_getevents() of libaio forces me to
take the address of the "ring" in user space as the context ID.

To my surprise, one comment line in the file "io_getevents.c" of
libaio-0.3.105 reads:

Ben will hate me for this

REFERENCE:

1. Linux kernel source code: http://www.kernel.org/pub/linux/kernel/v2.6/
(include/linux/aio_abi.h, fs/aio.c)

2. Linux manual pages: http://www.kernel.org/pub/linux/docs/manpages/
(io_setup(2), io_destroy(2), io_getevents(2), io_submit(2), io_cancel(2))

3. Linux Scalability Effort: http://lse.sourceforge.net/io/aio.html
The design notes: http://lse.sourceforge.net/io/aionotes.txt

4. The package libaio, both source and binary:
http://rpmfind.net/linux/rpm2html/search.php?query=libaio
Simple transparent interface to Linux AIO system calls.

5. Libaio-oracle: http://oss.oracle.com/projects/libaio-oracle/
POSIX AIO implementation based on Linux AIO system calls (depending on
libaio).
---snip---

Submitted by: Li, Xiao <intron@intron.ac>


# 162594 24-Sep-2006 jmg

hide kqueue_register from public view, and replace it w/ kqfd_register...
this eliminates a possible race in aio registering a kevent..


# 162066 06-Sep-2006 mp

Remove call to fdfree() for the AIO daemons to prevent kernel panics
with linprocfs. This call is not needed since file descriptor sharing
was removed in v1.125.

Reviewed by: alc, davidxu, ambrisko
MFC after: 3 days


# 161302 15-Aug-2006 netchild

- Change process_exec function handlers prototype to include struct
image_params arg.
- Change struct image_params to include struct sysentvec pointer and
initialize it.
- Change all consumers of process_exit/process_exec eventhandlers to
new prototypes (includes splitting up into distinct exec/exit functions).
- Add eventhandler to userret.

Sponsored by: Google SoC 2006
Submitted by: rdivacky
Parts suggested by: jhb (on hackers@)


# 159176 02-Jun-2006 ambrisko

Make lio ident more consistant with aio ident.


# 158373 09-May-2006 davidxu

Use a dedicated mutex to protect aio queues, the movation is to reduce
lock contention with other parts.


# 157073 24-Mar-2006 davidxu

1. Move code for scanning pending I/O from aio_fsync to aio_aqueue,
it has less overhead.
2. Avoid scheduling task if maximum number of I/O threads is reached.


# 157037 23-Mar-2006 davidxu

Implement aio_fsync() syscall.


# 156024 26-Feb-2006 davidxu

1. Remove aio entry from lists earlier in aio_free_entry,
so other threads can not see it if we unlock the proc
lock (this can happen in knlist_delete). Don't do wakeup,
it is not necessary.

2. Decrease kaio_buffer_count in biohelper rather than
doing it in aio_bio_done_notify.

3. In aio_bio_done_notify, don't send notification if KAIO_RUNDOWN
was set, because the process is already in single thread mode.

4. Use assignment to initialize aiothreadflags.

5. AIOCBLIST_RUNDOWN is not useful, axe the code using it.

6. use LIO_NOP instead of zero.


# 155887 22-Feb-2006 davidxu

If block size is zero, use normal file operations to do I/O,
this eliminates a divided-by-zero fault.

Recommended by: phk


# 154893 27-Jan-2006 davidxu

Just like dofilewrite(), call bwillwrite before fo_write.


# 154885 27-Jan-2006 davidxu

return final error code in aio_return rather than a hardcoded 0.


# 154854 26-Jan-2006 davidxu

in aio_aqueue, store same return code into job->_aiocb_private.error.
in aio_return, unlock proc lock before suword.


# 154765 24-Jan-2006 davidxu

Add locking annotation and comments about socket, pipe, fifo problem.
Temporarily fix a locking problem for socket I/O.


# 154745 24-Jan-2006 davidxu

Er, rescure a deleted comment line.


# 154744 24-Jan-2006 davidxu

More cleanup for aio code:
1) unregsiter kqueue filter for EVFILT_LIO.
2) free uma_zones.
3) call setsid directly to enter another session rather than
implementing by itself.

Submitted by: jhb


# 154740 23-Jan-2006 davidxu

Add bracket.


# 154706 23-Jan-2006 davidxu

Verify all supported notification types.


# 154698 23-Jan-2006 davidxu

1) Merge _aio_aqueue and aio_aqueue, check quota in aio_aqueue,
so that lio_listio won't exceed the quota.
2) Remove lio_ref_count, it is no longer used.


# 154672 22-Jan-2006 davidxu

Fix a bogus panic.


# 154671 22-Jan-2006 davidxu

Decrease kaio_active_count first, because user process may go away
after we notified it.


# 154669 22-Jan-2006 davidxu

Make aio code MP safe.


# 154389 15-Jan-2006 csjp

Initialize ki to p->p_aioinfo after we know it's going to be referencing
a valid kaioinfo structure. This avoids a potential NULL pointer dereference.

Found with: Coverity Prevent(tm)
MFC after: 2 weeks


# 154073 06-Jan-2006 jhb

Return error from fget_write() rather than hardcoding EBADF now that
fget_write() DTRT.

Requested by: bde


# 152215 08-Nov-2005 davidxu

In aio_waitcomplete, do not return EAGAIN if no other threads
have started aio, instead, initialize aio management structure
if it hasn't been done, the reason to adjust this behavior is
to make it a bit friendly for threaded program, consider two
threads, one submits aio_write, and another just calls
aio_waitcomplete to wait any I/O to be completed and recycle the
aio requests, before submitter doing any I/O, the recycler wants
to wait in kernel. This also fixes inconsistency with other aio
syscalls.


# 152208 08-Nov-2005 jhb

Various and sundry cleanups:
- Use curthread for calls to knlist_delete() and add a big comment
explaining why as well as appropriate assertions.
- Use TAILQ_FOREACH and TAILQ_FOREACH_SAFE instead of handrolling them.
- Use fget() family of functions to lookup file objects instead of
grovelling around in file descriptor tables.
- Destroy the aio_freeproc mutex if we are unloaded.

Tested on: i386


# 152029 04-Nov-2005 davidxu

Fix name compatible problem with POSIX standard. the sigval_ptr and
sigval_int really should be sival_ptr and sival_int.
Also sigev_notify_function accepts a union sigval value but not a
pointer.


# 151994 03-Nov-2005 davidxu

Support sending realtime signal information via signal queue, realtime
signal memory is pre-allocated, so kernel can always notify user code.


# 151932 01-Nov-2005 jhb

Push down Giant into fdfree() and remove it from two of the callers.
Other callers such as some rfork() cases weren't locking Giant anyway.

Reviewed by: csjp
MFC after: 1 week


# 151867 30-Oct-2005 davidxu

Fix sigevent's POSIX incompatible problem by adding member fields
sigev_notify_function and sigev_notify_attributes. AIO syscalls
use sigevent, so they have to be adjusted.

Reviewed by: alc


# 151273 12-Oct-2005 ambrisko

Fix tinderbox box by removing incomplete/bad spl usage. Proper giant free
locking is required in for aio.

Pointed out by: imp


# 151260 12-Oct-2005 ambrisko

Add in kqueue support to LIO event notification and fix how it handled
notifications when LIO operations completed. These were the problems
with LIO event complete notification:
- Move all LIO/AIO event notification into one general function
so we don't have bugs in different data paths. This unification
got rid of several notification bugs one of which if kqueue was
used a SIGILL could get sent to the process.
- Change the LIO event accounting to count all AIO request that
could have been split across the fast path and daemon mode.
The prior accounting only kept track of AIO op's in that
mode and not the entire list of operations. This could cause
a bogus LIO event complete notification to occur when all of
the fast path AIO op's completed and not the AIO op's that
ended up queued for the daemon.

Suggestions from: alc


# 148200 20-Jul-2005 alc

Eliminate inconsistency in the setting of the B_DONE flag. Specifically,
make the b_iodone callback responsible for setting it if it is needed.
Previously, it was set unconditionally by bufdone() without holding
whichever lock is shared by the b_iodone callback and the corresponding
top-half function. Consequently, in a race, the top-half function could
conclude that operation was done before the b_iodone callback finished.
See, for example, aio_physwakeup() and aio_fphysio().

Note: I don't believe that the other, more widely-used b_iodone callbacks
are affected.

Discussed with: jeff
Reviewed by: phk
MFC after: 2 weeks


# 147730 01-Jul-2005 ssouhlal

Fix the recent panics/LORs/hangs created by my kqueue commit by:

- Introducing the possibility of using locks different than mutexes
for the knlist locking. In order to do this, we add three arguments to
knlist_init() to specify the functions to use to lock, unlock and
check if the lock is owned. If these arguments are NULL, we assume
mtx_lock, mtx_unlock and mtx_owned, respectively.

- Using the vnode lock for the knlist locking, when doing kqueue operations
on a vnode. This way, we don't have to lock the vnode while holding a
mutex, in filt_vfsread.

Reviewed by: jmg
Approved by: re (scottl), scottl (mentor override)
Pointyhat to: ssouhlal
Will be happy: everyone


# 147092 07-Jun-2005 alc

In lio_listio(2) change jobref from an int to a long so that
lio_listio(LIO_WAIT, ...) works correctly on 64-bit architectures.

Reviewed by: tegge


# 146993 05-Jun-2005 alc

Eliminate an unused field from struct aio_liojob.


# 146963 04-Jun-2005 alc

Eliminate the original method of requesting notification of aio_read(2) and
aio_write(2) completion through kevent(2). This method does not work on
64-bit architectures. It was deprecated in FreeBSD 4.4. See revisions
1.87 and 1.70.2.7.

Change aio_physwakeup() to call psignal(9) directly rather than indirectly
through a timeout(9). Discussed with: bde

Correct a bug introduced in revision 1.65 that could result in premature
delivery of a signal if an lio_listio(2) consisted of a mixture of
direct/raw and queued I/O operations. Observed by: tegge

Eliminate a field from struct kaioinfo that is now unused.

Reviewed by: tegge


# 146914 03-Jun-2005 alc

Synchronize access to the per process aiocb lists in many of the functions.


# 146892 02-Jun-2005 alc

In aio_waitcomplete() correct two cases of using an aiocb after freeing it.


# 146815 30-May-2005 alc

Synchronize access to aio_freeproc with a mutex. Eliminate related spl
calls.

Reduce the scope of Giant in aio_daemon().


# 146813 30-May-2005 alc

Use the proc mtx to prevent simultaneous changes to p_aioinfo.


# 146809 30-May-2005 alc

Eliminate unnecessary calls to wakeup(); no one sleeps on &aio_freeproc.

Eliminate an unused flag, AIOP_SCHED; it's cleared but never set.


# 146798 30-May-2005 alc

Eliminate aio_activeproc; it's unused.


# 146787 29-May-2005 alc

Eliminate aio_bufjobs; it's unused.


# 145701 30-Apr-2005 jeff

- Acquire Giant in AIO's iodone routine. VFS will no longer do it for us
soon.

Sponsored by: Isilon Systems, Inc.


# 143776 18-Mar-2005 jmg

fix aio+kq... I've been running ambrisko's test program for much longer
w/o problems than I was before... This simply brings back the knote_delete
as knlist_delete which will also drop the knote's, instead of just clearing
the list and seeing _ONESHOT...

Fix a race where if a note was _INFLUX and _DETACHED, it could end up being
modified... whoopse..

MFC after: 1 week
Prodded by: ambrisko and dwhite


# 141633 10-Feb-2005 phk

Make a SYSCTL_NODE static


# 139804 06-Jan-2005 imp

/* -> /*- for copyright notices, minor format tweaks as necessary


# 137186 04-Nov-2004 phk

Remove buf->b_dev field.


# 137029 29-Oct-2004 phk

Give dev_strategy() an explict cdev argument in preparation for removing
buf->b-dev.

Put a bio between the buf passed to dev_strategy() and the device driver
strategy routine in order to not clobber fields in the buf.

Assert copyright on vfs_bio.c and update copyright message to canonical
text. There is no legal difference between John Dysons two-clause
abbreviated BSD license and the canonical text.


# 136966 26-Oct-2004 phk

Put the I/O block size in bufobj->bo_bsize.

We keep si_bsize_phys around for now as that is the simplest way to pull
the number out of disk device drivers in devfs_open(). The correct solution
would be to do an ioctl(DIOCGSECTORSIZE), but the point is probably mooth
when filesystems sit on GEOM, so don't bother for now.


# 136007 01-Oct-2004 alfred

cover soreadable and sowriteable with the corresponding socketbuffer locks.


# 135617 23-Sep-2004 phk

Eliminate DEV_STRATEGY() macro: call dev_strategy() directly.

Make dev_strategy() handle errors and departing devices properly.


# 134670 03-Sep-2004 rwatson

Tag AIO as requiring Giant over the network stack using
NET_NEEDS_GIANT().

RELENG_5 candidate.


# 133741 15-Aug-2004 jmg

Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by: green, rwatson (both earlier versions)


# 133660 13-Aug-2004 jmg

clean up whitespace...


# 132684 27-Jul-2004 alc

- Use atomic ops for updating the vmspace's refcnt and exitingcnt.
- Push down Giant into shmexit(). (Giant is acquired only if the vmspace
contains shm segments.)
- Eliminate the acquisition of Giant from proc_rwmem().
- Reduce the scope of Giant in exit1(), uncovering the destruction of the
address space.


# 130653 17-Jun-2004 rwatson

Merge additional socket buffer locking from rwatson_netperf:

- Lock down low hanging fruit use of sb_flags with socket buffer
lock.

- Lock down low hanging fruit use of so_state with socket lock.

- Lock down low hanging fruit use of so_options.

- Lock down low-hanging fruit use of sb_lowwat and sb_hiwat with
socket buffer lock.

- Annotate situations in which we unlock the socket lock and then
grab the receive socket buffer lock, which are currently actually
the same lock. Depending on how we want to play our cards, we
may want to coallesce these lock uses to reduce overhead.

- Convert a if()->panic() into a KASSERT relating to so_state in
soaccept().

- Remove a number of splnet()/splx() references.

More complex merging of socket and socket buffer locking to
follow.


# 129882 30-May-2004 phk

Add missing #include <sys/module.h>


# 126941 14-Mar-2004 peter

Make the process_exit eventhandler run without Giant. Add Giant hooks
in the two consumers that need it.. processes using AIO and netncp.
Update docs. Say that process_exec is called with Giant, but not to
depend on it. All our consumers can handle it without Giant.


# 122747 15-Nov-2003 phk

Send B_PHYS out to pasture, it no longer serves any function.


# 121483 24-Oct-2003 jmg

don't allow reading from files that haven't been open'd for reading.


# 121303 21-Oct-2003 simokawa

We need to initialize bp->b_offset and bp->b_iooffset
becuase bp->b_blkno is ignored now.


# 119956 10-Sep-2003 simokawa

Fix asynchronous physio breakage introduced in rev 1.163.
We cannnot use bp->b_caller2 because DEV_STRATEGY will overwrite it.


# 116678 22-Jun-2003 phk

Add a f_vnode field to struct file.

Several of the subtypes have an associated vnode which is used for
stuff like the f*() functions.

By giving the vnode a speparate field, a number of checks for the specific
subtype can be replaced simply with a check for f_vnode != NULL, and
we can later free f_data up to subtype specific use.

At this point in time, f_data still points to the vnode, so any code I
might have overlooked will still work.


# 116430 16-Jun-2003 phk

Get rid of the b_spc specialty field in struct buf by using an already
available caller private field.


# 116182 11-Jun-2003 obrien

Use __FBSDID().


# 114216 29-Apr-2003 kan

Deprecate machine/limits.h in favor of new sys/limits.h.
Change all in-tree consumers to include <sys/limits.h>

Discussed on: standards@
Partially submitted by: Craig Rodrigues <rodrigc@attbi.com>


# 113638 17-Apr-2003 jhb

- kthread's don't have p_textvp set to anything, so replace code that
dealt with that possibility with a KASSERT().
- No need to set P_SYSTEM, kthread_create() does that for us.


# 113149 05-Apr-2003 alc

Don't reinitialize fields that are already initialized by getpbuf().


# 113049 04-Apr-2003 alc

o Remove useracc() calls from aio_qphysio(); they are redundant
given the checks performed by vmapbuf().

Reviewed by: tegge


# 112564 24-Mar-2003 jhb

Replace the at_fork, at_exec, and at_exit functions with the slightly more
flexible process_fork, process_exec, and process_exit eventhandlers. This
reduces code duplication and also means that I don't have to go duplicate
the eventhandler locking three more times for each of at_fork, at_exec, and
at_exit.

Reviewed by: phk, jake, almost complete silence on arch@


# 111119 19-Feb-2003 imp

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 109623 21-Jan-2003 alfred

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 109572 20-Jan-2003 dillon

Close the remaining user address mapping races for physical
I/O, CAM, and AIO. Still TODO: streamline useracc() checks.

Reviewed by: alc, tegge
MFC after: 7 days


# 109177 13-Jan-2003 alfred

style(9) fixes, mostly add parens around return arguments.


# 109153 13-Jan-2003 dillon

Bow to the whining masses and change a union back into void *. Retain
removal of unnecessary casts and throw in some minor cleanups to see if
anyone complains, just for the hell of it.


# 109137 12-Jan-2003 tjr

Allowing nent < 0 in aio_suspend() and lio_listio() is just asking for
trouble. Return EINVAL instead.


# 109136 12-Jan-2003 tjr

Remove "XXX undocumented" comment from lio_listio().


# 109123 12-Jan-2003 dillon

Change struct file f_data to un_data, a union of the correct struct
pointer types, and remove a huge number of casts from code using it.

Change struct xfile xf_data to xun_data (ABI is still compatible).

If we need to add a #define for f_data and xf_data we can, but I don't
think it will be necessary. There are no operational changes in this
commit.


# 108586 03-Jan-2003 phk

Remove unused second argument from DEV_STRATEGY().


# 108311 27-Dec-2002 alfred

Lock filedesc while performing a range check on the file descriptor.

Reviewed by: alc


# 106998 17-Nov-2002 alfred

Rework the sysconf(3) interaction with aio:

sysconf.c:
Use 'break' rather than 'goto yesno' in sysconf.c so that we report a '0'
return value from the kernel sysctl.

vfs_aio.c:
Make aio reset its configuration parameters to -1 after unloading
instead of 0.

posix4_mib.c:
Initialize the aio configuration parameters to -1
to indicate that it is not loaded.
Add a facility (p31b_iscfg()) to determine if a posix4 facility has been
initialized to avoid having to re-order the SYSINITs.
Use p31b_iscfg() to determine if aio has had a chance to run yet which
is likely if it is compiled into the kernel and avoid spamming its
values.
Introduce a macro P31B_VALID() instead of doing the same comparison over
and over.

posix4.h:
Prototype p31b_iscfg().


# 106979 16-Nov-2002 alfred

Export the values for _SC_AIO_MAX and _SC_AIO_PRIO_DELTA_MAX via the p1003b
sysctl interface.


# 106975 16-Nov-2002 alfred

Call 'p31b_setcfg(CTL_P1003_1B_AIO_LISTIO_MAX, AIO_LISTIO_MAX)'
when AIO is initialized so that sysconf() gives correct results.

Reported by: Craig Rodrigues <rodrigc@attbi.com>


# 106574 07-Nov-2002 rwatson

Do a bit more work in the aio code to simulate the credential environment
of the original AIO request: save and restore the active thread credential
as well as using the file credential, since MAC (and some other bits of
the system) rely on the thread credential instead of/as well as the
file credential. In brief: cache td->td_ucred when the AIO operation
is queued, temporarily set and restore the kernel thread credential,
and release the credential when done. Similar to ktrace credential
management.

Reviewed by: alc
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 106057 27-Oct-2002 wollman

Change the way support for asynchronous I/O is indicated to applications
to conform to 1003.1-2001. Make it possible for applications to actually
tell whether or not asynchronous I/O is supported.

Since FreeBSD's aio implementation works on all descriptor types, don't
call down into file or vnode ops when [f]pathconf() is asked about
_PC_ASYNC_IO; this avoids the need for every file and vnode op to know about
it.


# 105409 18-Oct-2002 jhb

fdfree() clears p_fd for us, no need to do it again.


# 105073 14-Oct-2002 alc

Eliminate the unnecessary clearing of flag bits that are already clear
in lio_listio(2).


# 104354 02-Oct-2002 scottl

Some kernel threads try to do significant work, and the default KSTACK_PAGES
doesn't give them enough stack to do much before blowing away the pcb.
This adds MI and MD code to allow the allocation of an alternate kstack
who's size can be speficied when calling kthread_create. Passing the
value 0 prevents the alternate kstack from being created. Note that the
ia64 MD code is missing for now, and PowerPC was only partially written
due to the pmap.c being incomplete there.
Though this patch does not modify anything to make use of the alternate
kstack, acpi and usb are good candidates.

Reviewed by: jake, peter, jhb


# 102291 22-Aug-2002 archie

Replace (ab)uses of "NULL" where "0" is really meant.


# 102257 22-Aug-2002 alc

o Remove the AIOCBLIST_ASYNCFREE flag and related code. It's never set.

Submitted by: Romer Gil <rgil@cs.rice.edu>


# 101694 11-Aug-2002 alc

o Make a correction to the last change: In aio_cancel(2) return AIO_ALLDONE
instead of EINVAL if p->p_aioinfo is NULL.


# 101663 11-Aug-2002 alc

o In aio_cancel(2), make sure that p->p_aioinfo isn't NULL before
dereferencing it.

Submitted by: saureen <sshah@apple.com>


# 101437 06-Aug-2002 alc

Set the ident field of the struct kevent that is registered by _aio_aqueue()
to the address of the user's aiocb rather than the kernel's aiocb. (In other
words, prior to this change, the ident field returned by kevent(2) on
completion of an AIO was effectively garbage.)

Submitted by: Romer Gil <rgil@cs.rice.edu>


# 101402 05-Aug-2002 alc

o The introduction of kevent() broke lio_listio(): _aio_aqueue() thought
that LIO_READ and LIO_WRITE were requests for kevent()-based
notification of completion. Modify _aio_aqueue() to recognize LIO_READ
and LIO_WRITE.

Notes: (1) The patch provided by the PR perpetuates a second bug in this
code, a direct access to user-space memory. This change fixes that bug
as well. (2) This change is to code that implements a deprecated interface.
It should probably be removed after an MFC.

PR: kern/39556


# 97658 31-May-2002 tanimura

Back out my lats commit of locking down a socket, it conflicts with hsu's work.

Requested by: hsu


# 97287 25-May-2002 alc

o Remove some unnecessary casting from and add some necessary casting to
aio_suspend() and lio_listio().

Submitted by: bde


# 97219 24-May-2002 peter

Fix warnings. Also, removed an unused variable that I found that was just
initialized and never used afterwards.


# 96972 20-May-2002 tanimura

Lock down a socket, milestone 1.

o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
socket buffer. The mutex in the receive buffer also protects the data
in struct socket.

o Determine the lock strategy for each members in struct socket.

o Lock down the following members:

- so_count
- so_options
- so_linger
- so_state

o Remove *_locked() socket APIs. Make the following socket APIs
touching the members above now require a locked socket:

- sodisconnect()
- soisconnected()
- soisconnecting()
- soisdisconnected()
- soisdisconnecting()
- sofree()
- soref()
- sorele()
- sorwakeup()
- sotryfree()
- sowakeup()
- sowwakeup()

Reviewed by: alfred


# 95973 03-May-2002 tanimura

As malloc(9) and free(9) are now Giant-free, remove the Giant lock
across malloc(9) and free(9) of a pgrp or a session.


# 95123 20-Apr-2002 tanimura

Push down Giant for setpgid(), setsid() and aio_daemon(). Giant protects only
malloc(9) and free(9).


# 94861 16-Apr-2002 jhb

Lock proctree_lock instead of pgrpsess_lock.


# 94633 14-Apr-2002 alc

o Use aiocblist::fd_file in the AIO threads rather than recomputing
the file * from the calling process's descriptor table.
o Eliminate sharing of the calling process's descriptor table
with the AIO threads.


# 94162 08-Apr-2002 alc

Restructure aio_return() to eliminate duplicated code and facilitate Giant
push down.


# 94035 07-Apr-2002 alc

Reduce the duplication of code for error handling in _aio_aqueue().


# 94010 07-Apr-2002 alc

Change jobref and *ijoblist from int to long in order to avoid
a catastrophe after the 2^32nd AIO operation on 64-bit architectures.


# 93745 04-Apr-2002 alc

o aio_process needn't fhold()/fdrop() the fp now that _aio_aqueue() and
aio_free_entry() do this.
o Remove two unnecessary/unused variables from aio_process() and one field
from aiocblist.


# 93483 31-Mar-2002 alc

Keep the reference to the file acquired in _aio_aqueue() until the operation
completes. The reference is released in aio_free_entry().

Submitted by: tegge


# 93183 25-Mar-2002 bde

Added used include of <sys/sx.h>. Don't depend on namespace pollution in
<sys/file.h> or <sys/socketvar.h>.


# 92751 20-Mar-2002 jeff

Remove references to vm_zone.h and switch over to the new uma API.

Also, remove maxsockets. If you look carefully you'll notice that the old
zone allocator never honored this anyway.


# 91690 05-Mar-2002 eivind

Document all functions, global and static variables, and sysctls.
Includes some minor whitespace changes, and re-ordering to be able to document
properly (e.g, grouping of variables and the SYSCTL macro calls for them, where
the documentation has been added.)

Reviewed by: phk (but all errors are mine)


# 91140 23-Feb-2002 tanimura

Lock struct pgrp, session and sigio.

New locks are:

- pgrpsess_lock which locks the whole pgrps and sessions,
- pg_mtx which protects the pgrp members, and
- s_mtx which protects the session members.

Please refer to sys/proc.h for the coverage of these locks.

Changes on the pgrp/session interface:

- pgfind() needs the pgrpsess_lock held.

- The caller of enterpgrp() is responsible to allocate a new pgrp and
session.

- Call enterthispgrp() in order to enter an existing pgrp.

- pgsignal() requires a pgrp lock held.

Reviewed by: jhb, alfred
Tested on: cvsup.jp.FreeBSD.org
(which is a quad-CPU machine running -current)


# 90576 12-Feb-2002 alc

o Clearing p/td_retval[0] after aio_newproc() is unnecessary. (We stopped
calling rfork() to create aio threads in revision 1.46.)
o Don't recompute the FILE * when it's already stored in the kernel's AIOCB.


# 90361 07-Feb-2002 julian

Pre-KSE/M3 commit.
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.

Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,


# 89588 20-Jan-2002 alc

o Remove the unused vestiges of JOBST_JOBQPROC and
the per-thread jobtorun queue.
o Use TAILQ_EMPTY() instead of TAILQ_FIRST(...) == NULL.


# 89574 20-Jan-2002 alc

o Revision 1.99 ("KSE Milestone 2") left the aio daemons
sleeping on a process object but changed the corresponding
wakeup()s to the thread object. The result was that non-raw
aio ops waited for an aio daemon to timeout before action
was taken. Now, we sleep on the thread object.

PR: kern/34016


# 89465 17-Jan-2002 alc

o Eliminate an unused parameter from aio_fphysio().


# 89332 14-Jan-2002 alc

o Correct the initialization of aiolio_zone: Each entry was 16 times larger
than necessary.
o Move a rarely-used goto label inside a critical section so that we don't
perform an splnet() for which there is no corresponding splx().
o Remove unnecessary splnet()/splx() around accesses to kaioinfo::kaio_jobdone
in aio_return().
o Use TAILQ_FOREACH for simple cases of iteration over kaioinfo::kaio_jobdone.


# 89120 09-Jan-2002 alc

o Correct a 32/64-bit error in the initialization of aiol_zone, specifically,
sizeof(int) is not the size of a pointer.


# 88970 06-Jan-2002 alc

o Add missing synchronization (splnet()/splx()) in aio_free_entry().
o Move the definition of struct aiocblist from sys/aio.h to kern/vfs_aio.c.
o Make aio_swake_cb() static.


# 88803 02-Jan-2002 alc

o Properly check the file descriptor passed to aio_cancel(2). (Previously,
no out-of-bounds check was performed on the file descriptor.)
o Eliminate some excessive white space from aio_cancel(2).


# 88752 01-Jan-2002 alc

o Some style(9)-motivated changes to white space.


# 88730 31-Dec-2001 alc

o Correct an off-by-one error in aio_suspend(2).

PR: 18350


# 88729 31-Dec-2001 alc

o Use "td->td_proc" instead of "curproc" where possible.
o Eliminate the unnecessary initialization of several static variables
to zero.


# 88633 29-Dec-2001 alfred

Make AIO a loadable module.

Remove the explicit call to aio_proc_rundown() from exit1(), instead AIO
will use at_exit(9).

Add functions at_exec(9), rm_at_exec(9) which function nearly the
same as at_exec(9) and rm_at_exec(9), these functions are called
on behalf of modules at the time of execve(2) after the image
activator has run.

Use a modified version of tegge's suggestion via at_exec(9) to close
an exploitable race in AIO.

Fix SYSCALL_MODULE_HELPER such that it's archetecuterally neutral,
the problem was that one had to pass it a paramater indicating the
number of arguments which were actually the number of "int". Fix
it by using an inline version of the AS macro against the syscall
arguments. (AS should be available globally but we'll get to that
later.)

Add a primative system for dynamically adding kqueue ops, it's really
not as sophisticated as it should be, but I'll discuss with jlemon when
he's around.


# 87590 10-Dec-2001 alc

o Eliminate compilation warnings on 64-bit architectures.


# 87556 09-Dec-2001 alc

o Eliminate unnecessary synchronization from filt_aiodetach().
o The manual page for kevent says that EVFILT_AIO returns under the same
conditions as aio_error(). With that in mind, set the data field
of the returned struct kevent to the value that would be returned
by aio_error().
o Fix two compilation warnings.


# 84551 05-Oct-2001 jhb

The aio kthreads start off with a root credential just like all other
kthreads, so don't malloc a ucred just so we can create a duplicate of the
one we already have.


# 83366 12-Sep-2001 julian

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 75682 18-Apr-2001 alfred

Check validity of signal callback requested via aio routines.

Also move the insertion of the request to after the request is validated,
there's still looks like there may be some problems if an invalid address
is passed to the aio routines, basically a possible leak or having a
not completely initialized structure on the queue may still be possible.

A new sig macro was made _SIG_VALID to check the validity of a signal,
it would be advisable to use it from now on (in kern/kern_sig.c) rather
than rolling your own.

PR: kern/17152


# 74089 10-Mar-2001 alc

When aio_read/write() is used on a raw device, physical buffers are
used for up to "vfs.aio.max_buf_aio" of the requests. If a request
size is MAXPHYS, but the request base isn't page aligned, vmapbuf()
will map the end of the user space buffer into the start of the kva
allocated for the next physical buffer. Don't use a physical buffer
in this case. (This change addresses problem report 25617.)

When an aio_read/write() on a raw device has completed, timeout() is
used to schedule a signal to the process. Thus, the reporting is
delayed up to 10 ms (assuming hz is 100). The process might have
terminated in the meantime, causing a trap 12 when attempting to
deliver the signal. Thus, the timeout must be cancelled when removing
the job.

aio jobs in state JOBST_JOBQGLOBAL should be removed from the
kaio_jobqueue list during process rundown.

During process rundown, some aio jobs might move from one list to a
different list that has already been "emptied", causing the rundown to
be incomplete. Retry the rundown.

A call to BUF_KERNPROC() is needed after obtaining a physical buffer
to disassociate the lock from the running process since it can return
to userland without releasing that lock.

PR: 25617
Submitted by: tegge


# 74015 09-Mar-2001 alc

Use the kthread API to create and destroy AIO daemons.

Submitted by: jhb


# 73929 07-Mar-2001 jhb

Grab the process lock while calling psignal and before calling psignal.


# 73883 06-Mar-2001 alc

Add a missing splx() to aio_fphysio(). (This change is a no-op in -5.0,
but potentially significant in -4.x.)

Eliminate a pointless parameter to aio_fphysio().

Remove unnecessary casts from aio_fphysio() and aio_physwakeup().


# 73559 05-Mar-2001 alc

Eliminate the aio_freejobs list. Its purpose was to store free
aiocb's allocated by zalloc(). In other words, zfree() was never
called. Now, we call zfree(). Why eliminate this micro-
optimization? At some later point, when we multithread the AIO
system, we would need a mutex to synchronize access to aio_freejobs,
making its use nearly indistinguishable in cost from zalloc() and
zfree().

Remove unnecessary fhold() and fdrop() calls from aio_qphysio(),
undo'ing a part of revision 1.86. The reference count on the file
structure is already incremented by _aio_aqueue() before it calls
aio_qphysio(). (Update the comments to document this fact.)

Remove unnecessary casts from _aio_aqueue(), aio_read(), aio_write()
and aio_waitcomplete().

Remove an unnecessary "return;" from aio_process().

Add "static" in various places.


# 73398 04-Mar-2001 alc

Remove the field privatemodes from struct __aiocb_private and the
related code from aio_read() and aio_write(). This field was
intended, but never used, to allow a mythical user-level library to
make an aio_read() or aio_write() behave like an ordinary read() or
write(), i.e., a blocking I/O operation.


# 72200 09-Feb-2001 bmilekic

Change and clean the mutex lock interface.

mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)


# 72082 06-Feb-2001 asmodai

Fix typo: wierd -> weird.

There is no such thing as wierd in the english language.


# 72012 04-Feb-2001 phk

Another round of the <sys/queue.h> FOREACH transmogriffer.

Created with: sed(1)
Reviewed by: md5(1)


# 69512 02-Dec-2000 jake

Remove thr_sleep and thr_wakeup. Remove fields p_nthread and p_wakeup
from struct proc, which are now unused (p_nthread already was).
Remove process flag P_KTHREADP which was untested and only set
in vfs_aio.c (it should use kthread_create). Move the yield
system call to kern_synch.c as kern_threads.c has been removed
completely.

moral support from: alfred, jhb


# 69002 21-Nov-2000 alc

Provide a new interface for the user of aio_read() and aio_write() to request
a kevent upon completion of the I/O. Specifically, introduce a new type
of sigevent notification, SIGEV_EVENT. If sigev_notify is SIGEV_EVENT,
then sigev_notify_kqueue names the kqueue that should receive the event
and sigev_value contains the "void *" is copied into the kevent's udata
field.

In contrast to the existing interface, this one: 1) works on
the Alpha 2) avoids the extra copyin() call for the kevent because all
of the information needed is in the sigevent and 3) could be
applied to request a single kevent upon completion of an entire lio_listio().

Reviewed by: jlemon


# 68883 18-Nov-2000 dillon

This patchset fixes a large number of file descriptor race conditions.
Pre-rfork code assumed inherent locking of a process's file descriptor
array. However, with the advent of rfork() the file descriptor table
could be shared between processes. This patch closes over a dozen
serious race conditions related to one thread manipulating the table
(e.g. closing or dup()ing a descriptor) while another is blocked in
an open(), close(), fcntl(), read(), write(), etc...

PR: kern/11629
Discussed with: Alexander Viro <viro@math.psu.edu>


# 67907 29-Oct-2000 alc

_aio_aqueue(): Change kevent registration to use its own struct file pointer.
Otherwise, aio_read() and aio_write() on sockets are broken if a kevent is
registered. (The code after kevent registration for handling sockets assumes
that the struct file pointer "fp" still refers to the socket, not the kqueue.)


# 67365 20-Oct-2000 jhb

Catch up to moving headers:
- machine/ipl.h -> sys/ipl.h
- machine/mutex.h -> sys/mutex.h


# 66381 26-Sep-2000 alc

aio_qphysio: Eliminate one instance of an out-of-range check that is
performed twice. Eliminate initialization that is already performed
by _aio_aqueue.

aio_physwakeup: Eliminate redundant synchronization that is already
performed by bufdone.


# 65980 17-Sep-2000 bde

Added used include of <sys/mutex.h> (don't depend on pollution in
<sys/signalvar.h>).


# 65711 11-Sep-2000 jhb

aio processes need to have the Giant mutex before doing work.

Submitted by: tegge


# 65495 05-Sep-2000 truckman

Remove uidinfo hash table lookup and maintenance out of chgproccnt() and
chgsbsize(), which are called rather frequently and may be called from an
interrupt context in the case of chgsbsize(). Instead, do the hash table
lookup and maintenance when credentials are changed, which is a lot less
frequent. Add pointers to the uidinfo structures to the ucred and pcred
structures for fast access. Pass a pointer to the credential to chgproccnt()
and chgsbsize() instead of passing the uid. Add a reference count to the
uidinfo structure and use it to decide when to free the structure rather
than freeing the structure when the resource consumption drops to zero.
Move the resource tracking code from kern_proc.c to kern_resource.c. Move
some duplicate code sequences in kern_prot.c to separate helper functions.
Change KASSERTs in this code to unconditional tests and calls to panic().


# 65430 04-Sep-2000 alc

Make filt_aio() check the jobstate for JOBST_JOBBFINISHED (in addition
to JOBST_JOBFINISHED) in case the aio_read() or aio_write() was performed
via the high-performance physio method, i.e., aio_qphysio().


# 63983 28-Jul-2000 peter

Fix the #ifdef VFS_AIO to not compile a whole bunch of unused stuff in the
!VFS_AIO case. Lots of things have hooks into here (kqueue, exit(),
sockets, etc), I elected to keep the external interfaces the same
rather than spread more #ifdefs around the kernel.


# 60938 26-May-2000 jake

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


# 60833 23-May-2000 jake

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


# 60041 05-May-2000 phk

Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by: peter


# 59288 16-Apr-2000 jlemon

Introduce kqueue() and kevent(), a kernel event notification facility.


# 58934 02-Apr-2000 phk

Move B_ERROR flag to b_ioflags and call it BIO_ERROR.

(Much of this done by script)

Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED.

Move b_pblkno and b_iodone_chain to struct bio while we transition, they
will be obsoleted once bio structs chain/stack.

Add bio_queue field for struct bio aware disksort.

Address a lot of stylistic issues brought up by bde.


# 58349 20-Mar-2000 phk

Rename the existing BUF_STRATEGY() to DEV_STRATEGY()

substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)

substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)

This patch is machine generated except for the ccd.c and buf.h parts.


# 58345 20-Mar-2000 phk

Remove B_READ, B_WRITE and B_FREEBUF and replace them with a new
field in struct buf: b_iocmd. The b_iocmd is enforced to have
exactly one bit set.

B_WRITE was bogusly defined as zero giving rise to obvious coding
mistakes.

Also eliminate the redundant struct buf flag B_CALL, it can just
as efficiently be done by comparing b_iodone to NULL.

Should you get a panic or drop into the debugger, complaining about
"b_iocmd", don't continue. It is likely to write on your disk
where it should have been reading.

This change is a step in the direction towards a stackable BIO capability.

A lot of this patch were machine generated (Thanks to style(9) compliance!)

Vinum users: Greg has not had time to test this yet, be careful.


# 57395 23-Feb-2000 jasone

Add the VFS_AIO config option and leave it off by default. Unless the
VFS_AIO option is specified, all aio-related syscalls return ENOSYS.

The aio code is very fragile right now, and is unsuitable for default
inclusion in a production shell box.

Approved by: jkh


# 56322 20-Jan-2000 jasone

Back out the previous spl change, since it opens a race window.

Reviewed by: alfred, dillon, peter


# 56306 20-Jan-2000 jasone

Don't tsleep() while at splbio().

Correctly return EINPROGRESS from aio_error() even when an aio request
is still in the socket queue.

Submitted by: Adrian Chadd <adrian@bofh.co.uk>


# 56203 17-Jan-2000 green

Fix vn_isdisk() usage to make AIO work on non-disk-files again, rather
than just return ENOTBLK.

PR: 16163
Submitted by: Adrian Chadd <adrian@FreeBSD.org>


# 55943 14-Jan-2000 jasone

Add aio_waitcomplete(). Make aio work correctly for socket descriptors.
Make gratuitous style(9) fixes (me, not the submitter) to make the aio
code more readable.

PR: kern/12053
Submitted by: Chris Sedore <cmsedore@maxwell.syr.edu>


# 55756 10-Jan-2000 phk

Give vn_isdisk() a second argument where it can return a suitable errno.

Suggested by: bde


# 53577 22-Nov-1999 phk

Convert various pieces of code to use vn_isdisk() rather than checking
for vp->v_type == VBLK.

In ccd: we don't need to call VOP_GETATTR to find the type of a vnode.

Reviewed by: sos


# 52969 07-Nov-1999 phk

Simplify and de-bogotify check for raw disk.


# 52644 30-Oct-1999 phk

Change useracc() and kernacc() to use VM_PROT_{READ|WRITE|EXECUTE} for the
"rw" argument, rather than hijacking B_{READ|WRITE}.

Fix two bugs (physio & cam) resulting by the confusion caused by this.

Submitted by: Tor.Egge@fast.no
Reviewed by: alc, ken (partly)


# 52128 11-Oct-1999 peter

Trim unused options (or #ifdef for undoc options).

Submitted by: phk


# 51418 19-Sep-1999 green

This is what was "fdfix2.patch," a fix for fd sharing. It's pretty
far-reaching in fd-land, so you'll want to consult the code for
changes. The biggest change is that now, you don't use
fp->f_ops->fo_foo(fp, bar)
but instead
fo_foo(fp, bar),
which increments and decrements the fp refcount upon entry and exit.
Two new calls, fhold() and fdrop(), are provided. Each does what it
seems like it should, and if fdrop() brings the refcount to zero, the
fd is freed as well.

Thanks to peter ("to hell with it, it looks ok to me.") for his review.
Thanks to msmith for keeping me from putting locks everywhere :)

Reviewed by: peter


# 50477 28-Aug-1999 peter

$Id$ -> $FreeBSD$


# 49771 14-Aug-1999 phk

Spring cleaning around strategy and disklabels/slices:

Introduce BUF_STRATEGY(struct buf *, int flag) macro, and use it throughout.
please see comment in sys/conf.h about the flag argument.

Remove strategy argument from all the diskslice/label/bad144
implementations, it should be found from the dev_t.

Remove bogus and unused strategy1 routines.

Remove open/close arguments from dssize(). Pick them up from dev_t.

Remove unused and unfinished setgeom support from diskslice/label/bad144 code.


# 49678 13-Aug-1999 phk

s/v_specinfo/v_rdev/


# 49535 08-Aug-1999 phk

Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>,
a few lines into <sys/vnode.h>.

Add a few fields to struct specinfo, paving the way for the fun part.


# 48391 01-Jul-1999 peter

Slight reorganization of kernel thread/process creation. Instead of using
SYSINIT_KT() etc (which is a static, compile-time procedure), use a
NetBSD-style kthread_create() interface. kproc_start is still available
as a SYSINIT() hook. This allowed simplification of chunks of the
sysinit code in the process. This kthread_create() is our old kproc_start
internals, with the SYSINIT_KT fork hooks grafted in and tweaked to work
the same as the NetBSD one.

One thing I'd like to do shortly is get rid of nfsiod as a user initiated
process. It makes sense for the nfs client code to create them on the
fly as needed up to a user settable limit. This means that nfsiod
doesn't need to be in /sbin and is always "available". This is a fair bit
easier to do outside of the SYSINIT_KT() framework.


# 48377 30-Jun-1999 peter

Slight tweak to fork1() calling conventions. Add a third argument so
the caller can easily find the child proc struct. fork(), rfork() etc
syscalls set p->p_retval[] themselves. Simplify the SYSINIT_KT() code
and other kernel thread creators to not need to use pfind() to find the
child based on the pid. While here, partly tidy up some of the fork1()
code for RF_SIGSHARE etc.


# 48225 26-Jun-1999 mckusick

Convert buffer locking from using the B_BUSY and B_WANTED flags to using
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.


# 47680 01-Jun-1999 phk

Introduce the makebdev() function, it does the same as the makedev()
function for now, but that will change.


# 46796 09-May-1999 phk

major(something) can never become NODEV.


# 46676 08-May-1999 phk

I got tired of seeing all the cdevsw[major(foo)] all over the place.

Made a new (inline) function devsw(dev_t dev) and substituted it.

Changed to the BDEV variant to this format as well: bdevsw(dev_t dev)

DEVFS will eventually benefit from this change too.


# 46580 06-May-1999 phk

remove b_proc from struct buf, it's (now) unused.

Reviewed by: dillon, bde


# 46571 06-May-1999 peter

Fix up a few easy 'assignment used as truth value' and 'suggest parens
around && within ||' type warnings. I'm pretty sure I have not masked
any problems here, I've committed real problem fixes seperately.


# 46129 28-Apr-1999 luoqi

Enable vmspace sharing on SMP. Major changes are,
- %fs register is added to trapframe and saved/restored upon kernel entry/exit.
- Per-cpu pages are no longer mapped at the same virtual address.
- Each cpu now has a separate gdt selector table. A new segment selector
is added to point to per-cpu pages, per-cpu global variables are now
accessed through this new selector (%fs). The selectors in gdt table are
rearranged for cache line optimization.
- fask_vfork is now on as default for both UP and SMP.
- Some aio code cleanup.

Reviewed by: Alan Cox <alc@cs.rice.edu>
John Dyson <dyson@iquest.net>
Julian Elischer <julian@whistel.com>
Bruce Evans <bde@zeta.org.au>
David Greenman <dg@root.com>


# 45311 04-Apr-1999 dt

Add standard padding argument to pread and pwrite syscall. That should make them
NetBSD compatible.

Add parameter to fo_read and fo_write. (The only flag FOF_OFFSET mean that
the offset is set in the struct uio).

Factor out some common code from read/pread/write/pwrite syscalls.


# 44272 25-Feb-1999 bde

Added a used #include (don't depend on "vnode_if.h" including <sys/buf.h>).


# 44146 19-Feb-1999 luoqi

Hide access to vmspace:vm_pmap with inline function vmspace_pmap(). This
is the preparation step for moving pmap storage out of vmspace proper.

Reviewed by: Alan Cox <alc@cs.rice.edu>
Matthew Dillion <dillon@apollo.backplane.com>


# 43403 29-Jan-1999 dillon

More const fixes for -Wall, -Wcast-qual


# 43386 29-Jan-1999 bde

Removed bogus casts to c_caddr_t. This is part of terminating
c_caddr_t with extreme prejudice. Here the original casts to
caddr_t were to support K&R compilers (or missing prototypes),
but the relevant source files require an ANSI compiler.


# 43351 28-Jan-1999 dillon

Fix warnings related to -Wall -Wcast-qual


# 43311 28-Jan-1999 dillon

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# 43301 27-Jan-1999 dillon

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# 42957 21-Jan-1999 dillon

This is a rather large commit that encompasses the new swapper,
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.

Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>


# 41827 15-Dec-1998 des

Wrap two macros into do { ... } while (0), and fix the way they're used
in the kernel.

Reviewed by: bde


# 41370 27-Nov-1998 tegge

Don't forget to update the pmap associated with aio daemons when adding
new page directory entries for a growing kernel virtual address space.


# 40648 25-Oct-1998 phk

Nitpicking and dusting performed on a train. Removes trivial warnings
about unused variables, labels and other lint.


# 38402 17-Aug-1998 bde

Fixed nonsense overflow checking (checking that a long variable is less
than INT_MAX after it has possibly overflowed).

Removed an unused variable and its associated 2 style bugs.

Removed unused includes.


# 37673 15-Jul-1998 bde

Cast between longs and pointers via intptr_t. There shouldn't be
nearly so many casts here. Casting an pointer that was an integer
back to an integer just to compare it with -1 is bad, and casting
it back just to compare it with NULL is just wrong.


# 37406 05-Jul-1998 julian

fix braino from yesterdays' megacommit

Not sure of the result of it..
(may or may not effect anything) but it's fixed now.
(found by: comparing what cvsup sent back to me with what I tested..)


# 37389 04-Jul-1998 julian

There is no such thing any more as "struct bdevsw".

There is only cdevsw (which should be renamed in a later edit to deventry
or something). cdevsw contains the union of what were in both bdevsw an
cdevsw entries. The bdevsw[] table stiff exists and is a second pointer
to the cdevsw entry of the device. it's major is in d_bmaj rather than
d_maj. some cleanup still to happen (e.g. dsopen now gets two pointers
to the same cdevsw struct instead of one to a bdevsw and one to a cdevsw).

rawread()/rawwrite() went away as part of this though it's not strictly
the same patch, just that it involves all the same lines in the drivers.

cdroms no longer have write() entries (they did have rawwrite (?)).
tapes no longer have support for bdev operations.

Reviewed by: Eivind Eklund and Mike Smith
Changes suggested by eivind.


# 36848 10-Jun-1998 dfr

64bit fixes: don't cast pointers to int.


# 35256 17-Apr-1998 des

Seventy-odd "its" / "it's" typos in comments fixed as per kern/6108.


# 34961 30-Mar-1998 phk

Eradicate the variable "time" from the kernel, using various measures.
"time" wasn't a atomic variable, so splfoo() protection were needed
around any access to it, unless you just wanted the seconds part.

Most uses of time.tv_sec now uses the new variable time_second instead.

gettime() changed to getmicrotime(0.

Remove a couple of unneeded splfoo() protections, the new getmicrotime()
is atomic, (until Bruce sets a breakpoint in it).

A couple of places needed random data, so use read_random() instead
of mucking about with time which isn't random.

Add a new nfs_curusec() function.

Mark a couple of bogosities involving the now disappeard time variable.

Update ffs_update() to avoid the weird "== &time" checks, by fixing the
one remaining call that passwd &time as args.

Change profiling in ncr.c to use ticks instead of time. Resolution is
the same.

Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call
hzto() which subtracts time" sequences.

Reviewed by: bde


# 34925 28-Mar-1998 dufault

Finish _POSIX_PRIORITY_SCHEDULING. Needs P1003_1B and
_KPOSIX_PRIORITY_SCHEDULING options to work. Changes:

Change all "posix4" to "p1003_1b". Misnamed files are left
as "posix4" until I'm told if I can simply delete them and add
new ones;

Add _POSIX_PRIORITY_SCHEDULING system calls for FreeBSD and Linux;

Add man pages for _POSIX_PRIORITY_SCHEDULING system calls;

Add options to LINT;

Minor fixes to P1003_1B code during testing.


# 34924 28-Mar-1998 bde

Moved some #includes from <sys/param.h> nearer to where they are actually
used.


# 33827 25-Feb-1998 bde

Removed a stale comment and staler code.


# 33181 09-Feb-1998 eivind

Staticize.


# 33134 06-Feb-1998 eivind

Back out DIAGNOSTIC changes.


# 33108 04-Feb-1998 eivind

Turn DIAGNOSTIC into a new-style option.


# 31661 10-Dec-1997 dyson

Quiet some lint.


# 31620 08-Dec-1997 dyson

Correct prototypes to match POSIX. Correct return code for aio_cancel.
Submitted by: Alex Nash <nash@mcs.com>


# 31477 01-Dec-1997 dyson

Fix a problem when creating a new kernel thread. In some cases, aio_read
or aio_write can return the pid of the new thread. This is due to the
way that return values from system calls being passed by side-effect in
the proc structure now. This commit fixes the problem with aio_read and
aio_write.


# 31473 01-Dec-1997 dyson

Fix error handling for VCHR type I/O. Also, fix another spl problem, and
remove alot of overly verbose debugging statements.
ioproclist {
int aioprocflags; /* AIO proc flags */
TAILQ_ENTRY(aioproclist) list; /* List of processes */
struct proc *aioproc; /* The AIO thread */
TAILQ_HEAD (,aiocblist) jobtorun; /* suggested job to run */
};

/*
* data-structure for lio signal management
*/
struct aio_liojob {
int lioj_flags;
int lioj_buffer_count;
int lioj_buffer_finished_count;
int lioj_queue_count;
int lioj_queue_finished_count;
struct sigevent lioj_signal; /* signal on all I/O done */
TAILQ_ENTRY (aio_liojob) lioj_list;
struct kaioinfo *lioj_ki;
};
#define LIOJ_SIGNAL 0x1 /* signal on all done (lio) */
#define LIOJ_SIGNAL_POSTED 0x2 /* signal has been posted */

/*
* per process aio data structure
*/
struct kaioinfo {
int kaio_flags; /* per process kaio flags */
int kaio_maxactive_count; /* maximum number of AIOs */
int kaio_active_count; /* number of currently used AIOs */
int kaio_qallowed_count; /* maxiumu size of AIO queue */
int kaio_queue_count; /* size of AIO queue */
int kaio_ballowed_count; /* maximum number of buffers */
int kaio_queue_finished_count; /* number of daemon jobs finished */
int kaio_buffer_count; /* number of physio buffers */
int kaio_buffer_finished_count; /* count of I/O done */
struct proc *kaio_p; /* process that uses this kaio block */
TAILQ_HEAD (,aio_liojob) kaio_liojoblist; /* list of lio jobs */
TAILQ_HEAD (,aiocblist) kaio_jobqueue; /* job queue for process */
TAILQ_HEAD (,aiocblist) kaio_jobdone; /* done queue for process */
TAILQ_HEAD (,aiocblist) kaio_bufqueue; /* buffer job queue for process */
TAILQ_HEAD (,aiocblist) kaio_bufdone; /* buffer done queue for process */
};

#define KAIO_RUNDOWN 0x1 /* process is being run down */
#define KAIO_WAKEUP 0x2 /* wakeup process when there is a significant
event */


TAILQ_HEAD (,aioproclist) aio_freeproc, aio_activeproc;
TAILQ_HEAD(,aiocblist) aio_jobs; /* Async job list */
TAILQ_HEAD(,aiocblist) aio_bufjobs; /* Phys I/O job list */
TAILQ_HEAD(,aiocblist) aio_freejobs; /* Pool of free jobs */

static void aio_init_aioinfo(struct proc *p) ;
static void aio_onceonly(void *) ;
static int aio_free_entry(struct aiocblist *aiocbe);
static void aio_process(struct aiocblist *aiocbe);
static int aio_newproc(void) ;
static int aio_aqueue(struct proc *p, struct aiocb *job, int type) ;
static void aio_physwakeup(struct buf *bp);
static int aio_fphysio(struct proc *p, struct aiocblist *aiocbe, int type);
static int aio_qphysio(struct proc *p, struct aiocblist *iocb);
static void aio_daemon(void *uproc);

SYSINIT(aio, SI_SUB_VFS, SI_ORDER_ANY, aio_onceonly, NULL);

static vm_zone_t kaio_zone=0, aiop_zone=0,
aiocb_zone=0, aiol_zone=0, aiolio_zone=0;

/*
* Single AIOD vmspace shared amongst all of them
*/
static struct vmspace *aiovmspace = NULL;

/*
* Startup initialization
*/
void
aio_onceonly(void *na)
{
TAILQ_INIT(&aio_freeproc);
TAILQ_INIT(&aio_activeproc);
TAILQ_INIT(&aio_jobs);
TAILQ_INIT(&aio_bufjobs);
TAILQ_INIT(&aio_freejobs);
kaio_zone = zinit("AIO", sizeof (struct kaioinfo), 0, 0, 1);
aiop_zone = zinit("AIOP", sizeof (struct aioproclist), 0, 0, 1);
aiocb_zone = zinit("AIOCB", sizeof (struct aiocblist), 0, 0, 1);
aiol_zone = zinit("AIOL", AIO_LISTIO_MAX * sizeof (int), 0, 0, 1);
aiolio_zone = zinit("AIOLIO",
AIO_LISTIO_MAX * sizeof (struct aio_liojob), 0, 0, 1);
aiod_timeout = AIOD_TIMEOUT_DEFAULT;
aiod_lifetime = AIOD_LIFETIME_DEFAULT;
jobrefid = 1;
}

/*
* Init the per-process aioinfo structure.
* The aioinfo limits are set per-process for user limit (resource) management.
*/
void
aio_init_aioinfo(struct proc *p)
{
struct kaioinfo *ki;
if (p->p_aioinfo == NULL) {
ki = zalloc(kaio_zone);
p->p_aioinfo = ki


# 31465 30-Nov-1997 dyson

Correct a last minute code change. Would have been an infinite loop under
certain error conditions.
Submitted by: pst@shockwave.com


# 31464 30-Nov-1997 dyson

Fix an spl nit.


# 31456 30-Nov-1997 dyson

Finish up the vast majority of the AIO/LIO functionality. Proper signal
support was missing in the previous version of the AIO code. More
tunables added, and very efficient support for VCHR files has been added.
Kernel threads are not used for VCHR files, all work for such files is
done for the requesting process directly. Some attempt has been made to
charge the requesting process for resource utilization, but more work
is needed. aio_fsync is still missing (but the original fsync system
call can be used for now.) aio_cancel is essentially a noop, but that
is okay per POSIX. More aio_cancel functionality can be added later,
if it is found to be needed.

The functions implemented include:
aio_read, aio_write, lio_listio, aio_error, aio_return,
aio_cancel, aio_suspend.

The code has been implemented to support the POSIX spec 1003.1b
(formerly known as POSIX 1003.4 spec) features of the above. The
async I/O features are truly async, with the VCHR mode of operation
being essentially the same as physio (for appropriate files) for
maximum efficiency. This code also supports the signal capability,
is highly tunable, allowing management of resource usage, and
has been written to allow a per process usage quota.

Both the O'Reilly POSIX.4 book and the actual POSIX 1003.1b document
were the reference specs used. Any filedescriptor can be used with
these new system calls. I know of no exceptions where these
system calls will not work. (TTY's will also probably work.)


# 31444 29-Nov-1997 dyson

Disable the VCHR optimization for AIO until I have implemented it. Just in
case anyone wants to play with the POSIX AIO/LIO stuff. (As it is, it should
work with ANY vnode, on UP systems only, for now.)


# 31443 29-Nov-1997 dyson

Fix and complete the AIO syscalls. There are some performance enhancements
coming up soon, but the code is functional. Docs will be forthcoming.


# 31250 18-Nov-1997 bde

Get locking stuff by #including <sys/lock.h> instead of <sys/vnode.h>.


# 31016 07-Nov-1997 phk

Remove a bunch of variables which were unused both in GENERIC and LINT.

Found by: -Wunused


# 30994 06-Nov-1997 phk

Move the "retval" (3rd) parameter from all syscall functions and put
it in struct proc instead.

This fixes a boatload of compiler warning, and removes a lot of cruft
from the sources.

I have not removed the /*ARGSUSED*/, they will require some looking at.

libkvm, ps and other userland struct proc frobbing programs will need
recompiled.


# 30354 12-Oct-1997 phk

Last major round (Unless Bruce thinks of somthing :-) of malloc changes.

Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.

A couple of finer points by: bde


# 30309 11-Oct-1997 phk

Distribute and statizice a lot of the malloc M_* types.

Substantial input from: bde


# 30289 11-Oct-1997 dyson

Make the target for the number of AIO daemons work.


# 30240 09-Oct-1997 dyson

Major cleanup and debugging of the new AIO/LIO code. The LIO code is
now corrected. New tunables/instrumentation added. The code is now
likely "good enough to use." I will add the userland support soon.
The "high performance" mode for raw devices is still missing, and will
be added next. POSIX system calls that now appear to work:
aio_cancel, aio_error, aio_read, aio_return, aio_suspend, aio_write,
lio_listio. Missing, but to be added soon: aio_fsync.


# 29041 02-Sep-1997 bde

Removed unused #includes.


# 27465 17-Jul-1997 dyson

Clean up some lint associated with the AIO code.


# 27221 06-Jul-1997 dyson

This is an upgrade so that the kernel supports the AIO calls from
POSIX.4. Additionally, there is some initial code that supports LIO.
This code supports AIO/LIO for all types of file descriptors, with
few if any restrictions. There will be a followup very soon that
will support significantly more efficient operation for VCHR type
files (raw.) This code is also dependent on some kernel features
that don't work under SMP yet. After I commit the changes to the
kernel to support proper address space sharing on SMP, this code
will also work under SMP.


# 26670 16-Jun-1997 dyson

Add initial AIO/LIO kernel thread support files. This is preliminary, and
further features will be added.