History log of /freebsd-current/sys/kern/vfs_vnops.c
Revision Date Author Comments
# 473c90ac 10-May-2024 John Baldwin <jhb@FreeBSD.org>

uio: Use switch statements when handling UIO_READ vs UIO_WRITE

This is mostly to reduce the diff with CheriBSD which adds additional
constants to enum uio_rw, but also matches the normal style used for
uio_segflg.

Reviewed by: kib, emaste
Obtained from: CheriBSD
Differential Revision: https://reviews.freebsd.org/D45142


# 08f3d5b6 04-Apr-2024 Mark Johnston <markj@FreeBSD.org>

copy_file_range: Call vn_rdwr() at least once

This ensures that we invoke VOP_READ on the input file even if it's
empty, which in turn helps ensure that filesystems update the atime of
the file.

PR: 274615
Reviewed by: olce, rmacklem, kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D43524


# 89f1dcb3 14-Mar-2024 Rick Macklem <rmacklem@FreeBSD.org>

vfs_vnops.c: Use va_bytes >= va_size hint to avoid SEEK_DATA/SEEKHOLE

vn_generic_copy_file_range() tries to maintain holes
in file ranges being copied, using SEEK_DATA/SEEK_HOLE
where possible,

Unfortunately SEEK_DATA/SEEK_HOLE operations can take
a long time under certain circumstances.
Although it is not currently possible to know if a file has
unallocated data regions, the case where va_bytes >= va_size
is a strong hint that there are no unallocated data regions.
This hint does not work well for file systems doing compression,
but since it is only a hint, it is still useful.

For the case of va_bytes >= va_size, avoid doing SEEK_DATA/SEEK_HOLE.

Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D44509


# fa26f46d 23-Feb-2024 Jason A. Harmening <jah@FreeBSD.org>

vn_lock_pair(): allow lkflags1/lkflags2 to be 0 if vp1/vp2 is NULL

It's a bit strange to require the caller to pass contrived lock flags
if the corresponding vnode is NULL, simply to appease the assertion
that exactly one of LK_SHARED or LK_EXCLUSIVE must be set. On the
other hand, we still want to catch cases in which completely bogus
or corrupt flags are passed even if the corresponding vnode is NULL.
Therefore, specifically allow empty flags for lkflags1/lkflags2 iff
the respective vp1/vp2 param is NULL.

Reviewed by: kib, olce
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D44046


# 61cc4830 18-Jan-2024 Alfredo Mazzinghi <am2419@cl.cam.ac.uk>

Abstract UIO allocation and deallocation.

Introduce the allocuio() and freeuio() functions to allocate and
deallocate struct uio. This hides the actual allocator interface, so it
is easier to modify the sub-allocation layout of struct uio and the
corresponding iovec array.

Obtained from: CheriBSD
Reviewed by: kib, markj
MFC after: 2 weeks
Sponsored by: CHaOS, EPSRC grant EP/V000292/1
Differential Revision: https://reviews.freebsd.org/D43711


# f04220c1 19-Jan-2024 Konstantin Belousov <kib@FreeBSD.org>

kcmp(2): implement for vnode files

Reviewed by: brooks, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D43518


# b068bb09 07-Jan-2024 Konstantin Belousov <kib@FreeBSD.org>

Add vnode_pager_clean_{a,}sync(9)

Bump __FreeBSD_version for ZFS use.

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D43356


# 2319ca6a 31-Dec-2023 Rick Macklem <rmacklem@FreeBSD.org>

vfs_vnops.c: Fix vn_generic_copy_file_range() for truncation

When copy_file_range(2) was first being developed,
*inoffp + len had to be <= infile_size or an error was
returned. This semantic (as defined by Linux) changed
to allow *inoffp + len to be greater than infile_size and
the copy would end at *inoffp + infile_size.

Unfortunately, the code that decided if the outfd should
be truncated in length did not get updated for this
semantics change.
As such, if a copy_file_range(2) is done, where infile_size - *inoffp
is less that outfile_size but len is large, the outfd file is truncated
when it should not be. (The semantics for this for Linux is to not
truncate outfd in this case.)

This patch fixes the problem. I believe the calculation is safe
for all non-negative values of outsize, *outoffp, *inoffp and insize,
which should be ok, since they are all guaranteed to be non-negative.

Note that this bug is not observed over NFSv4.2, since it truncates
len to infile_size - *inoffp.

PR: 276045
Reviewed by: asomers, kib
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D43258


# c5405d1c 18-Nov-2023 Konstantin Belousov <kib@FreeBSD.org>

vn_copy_file_range(): provide ENOSYS fallback to vn_generic_copy_file_range()

Reviewed by: markj, Olivier Certner <olce.freebsd@certner.fr>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D42603


# a9bc8637 18-Nov-2023 Konstantin Belousov <kib@FreeBSD.org>

vn_copy_file_range(): find write vnodes on which to call the VOP

Reviewed by: markj, Olivier Certner <olce.freebsd@certner.fr>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D42603


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 305a2676 19-Nov-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: dodge locking for lseek(fd, 0, SEEK_CUR)

It is very common and according to dtrace while running poudriere almost
all calls with SEEK_CUR pass 0.


# 22bac49b 16-Nov-2023 Konstantin Belousov <kib@FreeBSD.org>

vn_lock_pair(): reasonably handle vp1 == vp2 case

Lock the vnode in the most exclusive lock mode requested, once.
All callers already ensure that vp1 != vp2 or are careful enough to only
unlock once otherwise.

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D42642


# 23210f53 12-Nov-2023 Konstantin Belousov <kib@FreeBSD.org>

vn_copy_file_range(): busy both in and out mp around call to VOP_COPY_FILE_RANGE()

This is required e.g. for nullfs to ensure liveness of the lower mount
points.

Reviewed by: jah, rmacklem, Olivier Certner <olce.freebsd@certner.fr>
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D42554


# 89188bd6 12-Nov-2023 Konstantin Belousov <kib@FreeBSD.org>

vn_copy_file_range(): use local variables for invp/outvp vnodes v_mounts

This avoids possible NULL dereference when checking mnt_vfc names.

Reviewed by: jah, rmacklem, Olivier Certner <olce.freebsd@certner.fr>
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D42554


# 969071be 06-Sep-2023 Martin Matuska <mm@FreeBSD.org>

vfs: copy_file_range() between multiple mountpoints of the same fs type

VOP_COPY_FILE_RANGE(9) is now caled when source and target vnodes
reside on the same filesystem type (not just on the same mountpoint).
The check if vnodes are on the same mountpoint must be done in the
filesystem code. There are currently only three users - fusefs(5) already
has this check, ZFS can handle multiple mountpoints and a check has been
added to NFS client.

ZFS block cloning is now possible between all snapshots and datasets
of the same ZFS pool.

MFC after: 1 week
Reviewed by: rmacklem
Differential Revision: https://reviews.freebsd.org/D41721


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 821dec4d 05-Aug-2023 Konstantin Belousov <kib@FreeBSD.org>

vnode io: request range-locking when pgcache reads are enabled

PR: 272678
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D41334


# 651fdc3d 05-Aug-2023 Konstantin Belousov <kib@FreeBSD.org>

Revert "vnode read(2)/write(2): acquire rangelock regardless of do_vn_io_fault()"

This reverts commit 5b353925ff61b9ddb97bb453ba75278b578ed7d9.

The reason is the lesser scalability of the vnode' rangelock comparing
with the vnode lock.

Requested by: mjg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D41334


# 5b353925 23-Jul-2023 Konstantin Belousov <kib@FreeBSD.org>

vnode read(2)/write(2): acquire rangelock regardless of do_vn_io_fault()

To ensure atomicity of reads against parallel writes and truncates,
vnode lock was not enough at least since introduction of vn_io_fault().
That code only take rangelock when it was possible that vn_read() and
vn_write() could drop the vnode lock.

At least since the introduction of VOP_READ_PGCACHE() which generally
does not lock the vnode at all, rangelocks become required even
for filesystems that do not need vn_io_fault() workaround. For
instance, tmpfs.

PR: 272678
Analyzed and reviewed by: Andrew Gierth <andrew@tao11.riddles.org.uk>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D41158


# e38c634b 19-Jul-2023 Dmitry Chagin <dchagin@FreeBSD.org>

vfs: Add a parenthese to vn_lock_pair() asserts to silence gcc

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D41070


# f5837839 09-Jul-2023 Olivier Certner <olce.freebsd@certner.fr>

vn_lock_pair(): Support passing LK_NODDLKTREAT

Since this function ultimately calls vn_lock() or VOP_LOCK1(), allows it to
receive and pass this flag which is used in the lookup code and doesn't
interfere with the function's operation.

Reviewed by: kib, markj
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D40954


# 2544b8e0 28-Apr-2023 Olivier Certner <olce.freebsd@certner.fr>

vfs: Rename vfs_emptydir() to vn_dir_check_empty()

No functional change. While here, adapt comments to style(9).

Reviewed by: kib
MFC after: 1 week


# c21d87a8 28-Apr-2023 Olivier Certner <olce.freebsd@certner.fr>

vfs: vn_dir_next_dirent(): Adapt comments to style(9)

No functional change.

Reviewed by: kib
MFC after: 1 week


# 3d8450db 24-Apr-2023 Olivier Certner <olce.freebsd@certner.fr>

vfs: vn_dir_next_dirent(): Simplify interface and harden

Simplify the old interface (one less argument, simpler termination test)
and add documentation about it. Add more sanity checks (mostly under
INVARIANTS, but also in the general case to prevent infinite
loops). Drop the explicit test on minimum directory entry size (without
INVARIANTS).

Deal with the impacts in callers (dirent_exists() and vop_stdvptocnp()).
dirent_exists() has been simplified a bit, preserving the exact same
semantics but for the return code whose meaning has been reversed (0 now
means the entry exists, ENOENT that it doesn't and other values are
genuine errors). While here, suppress gratuitous casts of malloc return
values.

vn_dir_next_dirent() has been tested by a 'make -j4 buildkernel' with a
temporary modification to the VFS cache causing vn_vptocnp() to always
call VOP_VPTOCNP() and finally vop_stdvptocnp() (observed with temporary
debug counters).

Export new _GENERIC_MINDIRSIZ and _GENERIC_MAXDIRSIZ on __BSD_VISIBLE,
and GENERIC_MINDIRSIZ and GENERIC_MAXDIRSIZ on _KERNEL.

Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39764


# 6bce3f23 23-Apr-2023 Olivier Certner <olce.freebsd@certner.fr>

vfs: Export get_next_dirent() as vn_dir_next_dirent()

Move internal-to-'vfs_default.c' get_next_dirent() to 'vfs_vnops.c' and
export it for use by other parts of the VFS. This is a preparatory
change for using it in vfs_emptydir().

No functional change.

Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39755


# 37e6036b 24-Apr-2023 Konstantin Belousov <kib@FreeBSD.org>

VOP_CLOSE(): MNTK_EXTENDED_SHARED filesystems do not need excl lock

All in-tree implementations of VOP_CLOSE() for filesystems proclaiming
MNTK_EXTENDED_SHARED, are fine with the shared lock for the closed
vnode. I checked the following implementations:
ffs
ext2
ufs
null
tmpfs
devfs
fdescfs
cd9660
zfs
It seems that initial addition of FWRITE check was due to necessity of
handling the VV_TEXT vnode vflag. Since VOP_ADD_WRITECOUNT() only
requires shared lock, we can relax the locking requirement there.

Reviewed by: markj, Olivier Certner <olce.freebsd@certner.fr>
Tested by: Olivier Certner
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D39784


# 6a5e6140 24-Apr-2023 Olivier Certner <olce.freebsd@certner.fr>

vn_open_vnode(): fix locking around VOP_CLOSE() on advisory lock error

In the case of a FIFO or if trying to open a file for writing, an
exclusive lock is necessary.

Reviewed by: kib
MFC after: 1 week


# afa8f897 05-Apr-2023 Konstantin Belousov <kib@FreeBSD.org>

vn_start_write(): consistently set *mpp to NULL on error or after failed sleep

This ensures that *mpp != NULL iff vn_finished_write() should be
called, regardless of the returned error, except for V_NOWAIT.
The only exception that must be maintained is the case where
vn_start_write(V_NOWAIT) is called with the intent of later dropping
other locks and then doing vn_start_write(V_XSLEEP), which needs the mp
value calculated from the non-waitable call above it.

Also note that V_XSLEEP is not supported by vn_start_secondary_write().

Reviewed by: markj, mjg (previous version), rmacklem (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39441


# b2f32887 09-Apr-2023 Konstantin Belousov <kib@FreeBSD.org>

vn_start_write(): minor style

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39441


# bb24eaea 05-Apr-2023 Konstantin Belousov <kib@FreeBSD.org>

vn_lock_pair(): allow to request shared locking

If either of vnodes is shared locked, lock must not be recursed.

Requested by: rmacklem
Reviewed by: markj, rmacklem
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39444


# 3b605620 03-Feb-2023 Konstantin Belousov <kib@FreeBSD.org>

FIOSEEKHOLE/FIOSEEKDATA: correct consistency for bmap-based implementation

Writes on UFS through a mapped region do not allocate disk blocks in
holes immediately. The blocks are allocated when the pages are paged out
first time.

This breaks the algorithm in vn_bmap_seekhole() and ufs_bmap_seekdata(),
because VOP_BMAP() reports hole for the place which already contains a
valid data.

Clean the pages before doing VOP_BMAP() in the affected functions. In
principle, we could clean less by only requesting clean starting from
the offset, but it is probably not very important.

PR: 269261
Reported by: asomers
Reviewed by: asomers, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D38379


# dec7db49 23-Jan-2023 Jiajie Chen <c@jia.je>

Add kf_file_nlink field to kf_file and populate it

This will allow user-space programs (e.g. lsof) to locate deleted files
whose nlink equals zero. Prior to this commit, programs has to use
stat(kf_path) to get nlink, but that will fail if the file is deleted.

[mjg: s/fail/file in the commit message]

Reviewed by: mjg
Differential Revision: https://reviews.freebsd.org/D38169


# 456f0575 19-Jan-2023 Konstantin Belousov <kib@FreeBSD.org>

Handle int rank issues in in vn_getsize_locked() and vn_seek()

In vn_getsize_locked(), when storing vattr.va_size of type u_quad_t into
off_t size, we must avoid overflow.

Then, the check for fsize < 0, introduced in the commit
f45feecfb27ca51067d6789eaa43547cadc4990b 'vfs: add vn_getsize', is nop [1].

Reported and reviewed by: jhb
Coverity CID: 1502346
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D38133


# f45feecf 22-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add vn_getsize

getattr is very expensive and in important cases only gets called to get
the size. This can be optimized with a dedicated routine which obtains
that statistic.

As a step towards that goal make size-only consumers use a dedicated
routine.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D37885


# 4ee16246 16-Nov-2022 Rick Macklem <rmacklem@FreeBSD.org>

vfs_vnops.c: Fix blksize for ZFS

Since ZFS reports _PC_MIN_HOLE_SIZE as 512 (although it
appears that an unwritten region must be at least f_iosize
to remain unallocated), vn_generic_copy_file_range()
uses 4096 for the copy blksize for ZFS, reulting in slow copies.

For most other file systems, _PC_MIN_HOLE_SIZE and f_iosize
are the same value, so this patch modifies the code to
use f_iosize for most cases. It also documents in comments
why the blksize is being set a certain way, so that the code
does not appear to be doing "magic math".

Reported by: allanjude
Reviewed by: allanjude, asomers
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D37076


# 33ce1788 17-Oct-2022 Konstantin Belousov <kib@FreeBSD.org>

vn_bmap_seekhole: check that passed offset is non-negative

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37024


# 52360ca3 25-Sep-2022 Alan Somers <asomers@FreeBSD.org>

copy_file_range: truncate write if it would exceed RLIMIT_FSIZE

PR: 266611
MFC after: 2 weeks
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D36706


# 1b4b7517 18-Sep-2022 Konstantin Belousov <kib@FreeBSD.org>

Add vn_rlimit_fsizex() and vn_rlimit_fsizex_res()

The vn_rlimit_fsizex() function:
- checks that the write does not exceed RLIMIT_FSIZE limit and fs
maximum supported file size
- truncates write length if it exceeds the RLIMIT_FSIZE or max file size,
but there are some bytes to write
- sends SIGXFSZ if RLIMIT_FSIZE would be exceed otherwise

POSIX mandates the truncated write in case when some bytes can be
written but whole write request fails the RLIMIT_FSIZE check.

The function is supposed to be used from VOP_WRITE()s. Due to
pecularity in the VFS generic write syscall layer, uio_resid must
correctly reflect the written amount (noted by markj). Provide the dual
vn_rlimit_fsizex_res() function to correct uio_resid after the clamp
done in vn_rlimit_fsizex() on VOP_WRITE() return.

PR: 164793
Reviewed by: asomers, jah, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D36625


# 2ac083f6 18-Sep-2022 Konstantin Belousov <kib@FreeBSD.org>

Add vn_rlimit_trunc()

Reviewed by: asomers, jah, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D36625


# cc65a412 18-Sep-2022 Konstantin Belousov <kib@FreeBSD.org>

filesystems: return error from vn_rlimit_fsize() instead of EFBIG

Reviewed by: asomers, jah, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D36625


# a75d1ddd 17-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: introduce V_PCATCH to stop abusing PCATCH


# 9e4f35ac 17-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: deperl msleep flag calculation in vn_start_*write


# a755fb92 10-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: retire the V_MNTREF flag

Reviewed by: kib, mckusick
Differential Revision: https://reviews.freebsd.org/D36521


# eb9a1f9c 16-Aug-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: plug dead store in vn_io_fault1

Reported by: clang --analyze


# 8c9aa94b 23-Jul-2022 Ka Ho Ng <khng@FreeBSD.org>

Convert runtime param checks to KASSERTs for fo_fspacectl

Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D35880


# b7262756 02-Apr-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fixup WANTIOCTLCAPS on open

In some cases vn_open_cred overwrites cn_flags, effectively nullifying
initialisation done in NDINIT. This will have to be fixed.

In the meantime make sure the flag is passed.

Reported by: jenkins
Noted by: Mathieu <sigsys@gmail.com>


# bb92cd7b 24-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: NDFREE(&nd, NDF_ONLY_PNBUF) -> NDFREE_PNBUF(&nd)


# d28af1ab 15-Nov-2021 Mark Johnston <markj@FreeBSD.org>

vm: Add a mode to vm_object_page_remove() which skips invalid pages

This will be used to break a deadlock in ZFS between the per-mountpoint
teardown lock and page busy locks. In particular, when purging data
from the page cache during dataset rollback, we want to avoid blocking
on the busy state of invalid pages since the busying thread may be
blocked on the teardown lock in zfs_getpages().

Add a helper, vn_pages_remove_valid(), for use by filesystems. Bump
__FreeBSD_version so that the OpenZFS port can make use of the new
helper.

PR: 258208
Reviewed by: avg, kib, sef
Tested by: pho (part of a larger patch)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32931


# f0c9847a 06-Nov-2021 Rick Macklem <rmacklem@FreeBSD.org>

vfs: Add "ioflag" and "cred" arguments to VOP_ALLOCATE

When the NFSv4.2 server does a VOP_ALLOCATE(), it needs
the operation to be done for the RPC's credential and not
td_ucred. It also needs the writing to be done synchronously.

This patch adds "ioflag" and "cred" arguments to VOP_ALLOCATE()
and modifies vop_stdallocate() to use these arguments.

The VOP_ALLOCATE.9 man page will be patched separately.

Reviewed by: khng, kib
Differential Revision: https://reviews.freebsd.org/D32865


# 2b68eb8e 01-Oct-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove thread argument from VOP_STAT

and fo_stat.


# b4a58fbf 01-Oct-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove cn_thread

It is always curthread.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D32453


# d71e1a88 25-Sep-2021 Mateusz Guzik <mjg@FreeBSD.org>

fifo: support flock

This evens it up with Linux.

Original patch by: Greg V <greg@unrelenting.technology>
Differential Revision: https://reviews.freebsd.org/D24255#565302


# 2bd98269 16-Sep-2021 Mark Johnston <markj@FreeBSD.org>

vfs: Permit unix sockets to be opened with O_PATH

As with FIFOs, a path descriptor for a unix socket cannot be used with
kevent().

In principle connectat(2) and bindat(2) could be modified to support an
AT_EMPTY_PATH-like mode which operates on the socket referenced by an
O_PATH fd referencing a unix socket. That would eliminate the path
length limit imposed by sockaddr_un.

Update O_PATH tests.

Reviewed by: kib
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31970


# c5128c48 07-Sep-2021 Rick Macklem <rmacklem@FreeBSD.org>

VOP_COPY_FILE_RANGE: Add a COPY_FILE_RANGE_TIMEO1SEC flag

Although it is not specified in the RFCs, the concept that
the NFSv4 server should reply to an RPC request within a
reasonable time is accepted practice within the NFSv4 community.

Without this patch, the NFSv4.2 server attempts to reply to
a Copy operation within 1second by limiting the copy to
vfs.nfs.maxcopyrange bytes (default 10Mbytes). This is crude at
best, given the large variation in I/O subsystem performance.

This patch adds a kernel only flag COPY_FILE_RANGE_TIMEO1SEC
that the NFSv4.2 can specify, which tells VOP_COPY_FILE_RANGE()
to return after approximately 1 second with a partial result and
implements this in vn_generic_copy_file_range(), used by
vop_stdcopyfilerange().

Modifying the NFSv4.2 server to set this flag will be done in
a separate patch. Also under consideration is exposing the
COPY_FILE_RANGE_TIMEO1SEC to userland for use on the FreeBSD
copy_file_range(2) syscall.

MFC after: 2 weeks
Reviewed by: khng
Differential Revision: https://reviews.freebsd.org/D31829


# 92bb74fd 01-Sep-2021 Ka Ho Ng <khng@FreeBSD.org>

vfs: Use file_cred for VOP_DEALLOCATE in vn_deallocate if non-NULL

This changes vn_deallocate() to match the behavior of vn_rdwr() when
picking which ucred to use. That is, vn_deallocate() uses file_cred for
making VOP call if it is non-NULL, or use active_cred otherwise.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31712


# a58e222b 29-Aug-2021 Ka Ho Ng <khng@FreeBSD.org>

vfs: yield in vn_deallocate_impl() loop

Yield at the end of each loop iteration if there are remaining works as
indicated by the value of *len updated by VOP_DEALLOCATE. Without this,
when calling vop_stddeallocate to zero a large region, the
implementation only zerofills a relatively small chunk and returns.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31705


# a638dc4e 12-Aug-2021 Ka Ho Ng <khng@FreeBSD.org>

vfs: Add ioflag to VOP_DEALLOCATE(9)

The addition of ioflag allows callers passing
IO_SYNC/IO_DATASYNC/IO_DIRECT down to the file system implementation.
The vop_stddeallocate fallback implementation is updated to pass the
ioflag to the file system implementation. vn_deallocate(9) internally is
also changed to pass ioflag to the VOP_DEALLOCATE call.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D31500


# c15384f8 12-Aug-2021 Ka Ho Ng <khng@FreeBSD.org>

vfs: Add get_write_ioflag helper to calculate ioflag

Converted vn_write to use this helper.

Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31513


# 4a9b832a 11-Aug-2021 Ka Ho Ng <khng@FreeBSD.org>

vfs: Rename ioflg to ioflag in vn_deallocate

This includes a style fix around ioflag checking as well.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib, bcr
Differential Revision: https://reviews.freebsd.org/D31505


# c18c74a8 06-Aug-2021 Rick Macklem <rmacklem@FreeBSD.org>

namei: Add cn_flags bits for OPENREAD and OPENWRITE

VOP_LOOKUP() is called with cn_flags bits ISLASTCN and ISOPEN
to indicate that the lookup is for the last component of a pathname
when doing open.

If the cn_flags also indicates if the open is for Reading, Writing or Both,
the NFSv4 client can do an NFSv4 Open operation in the same compound
RPC as Lookup, often avoiding the additional Open RPC now done when
VOP_OPEN() is called.

This patch defines two new cn_flags bits called OPENREAD and OPENWRITE
and sets these in open2nameif() based on FREAD, FWRITE flag bits.
This will allow a subsequent patch to the NFSv4 client to do the Open
operation in the same RPC as Lookup.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31431


# 0dc332bf 05-Aug-2021 Ka Ho Ng <khng@FreeBSD.org>

Add fspacectl(2), vn_deallocate(9) and VOP_DEALLOCATE(9).

fspacectl(2) is a system call to provide space management support to
userspace applications. VOP_DEALLOCATE(9) is a VOP call to perform the
deallocation. vn_deallocate(9) is a public KPI for kmods' use.

The purpose of proposing a new system call, a KPI and a VOP call is to
allow bhyve or other hypervisor monitors to emulate the behavior of SCSI
UNMAP/NVMe DEALLOCATE on a plain file.

fspacectl(2) comprises of cmd and flags parameters to specify the
space management operation to be performed. Currently cmd has to be
SPACECTL_DEALLOC, and flags has to be 0.

fo_fspacectl is added to fileops.
VOP_DEALLOCATE(9) is added as a new VOP call. A trivial implementation
of VOP_DEALLOCATE(9) is provided.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D28347


# abbb57d5 04-Aug-2021 Ka Ho Ng <khng@FreeBSD.org>

vfs: Introduce vn_bmap_seekhole_locked()

vn_bmap_seekhole_locked() is factored out version of vn_bmap_seekhole().
This variant requires shared vnode lock being held around the call.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D31404


# 0ef5eee9 03-Aug-2021 Konstantin Belousov <kib@FreeBSD.org>

Add vn_lktype_write()

and remove repetetive code that calculates vnode locking type for write.

Reviewed by: khng, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31405


# a19ae1b0 02-Jun-2021 Rich Ercolani <rincebrain@gmail.com>

vfs: fix MNT_SYNCHRONOUS check in vn_write

ca1ce50b2b5ef11d ("vfs: add more safety against concurrent forced
unmount to vn_write") has a side effect of only checking MNT_SYNCHRONOUS
if O_FSYNC is set.

Reviewed By: mjg
Differential Revision: https://reviews.freebsd.org/D30610


# 478c52f1 29-May-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: slightly rework vn_rlimit_fsize


# e71d5c73 22-May-2021 Mateusz Guzik <mjg@FreeBSD.org>

Fix limit testing after 1762f674ccb571e6 ktrace commit.

The previous:

if ((uoff_t)uio->uio_offset + uio->uio_resid > lim)
signal(....);

was replaced with:

if ((uoff_t)uio->uio_offset + uio->uio_resid < lim)
return;
signal(....);

Making (uoff_t)uio->uio_offset + uio->uio_resid == lim trip over the
limit, when it did not previously.

Unbreaks running 13.0 buildworld.


# 48235c37 22-May-2021 Mateusz Guzik <mjg@FreeBSD.org>

Fix a braino in previous.

Instead of trying to partially ifdef out ktrace handling, define the
missing identifier to 0. Without this fix lack of ktrace in the kernel
also means there is no SIGXFSZ signal delivery.


# 154f0ecc 22-May-2021 Mateusz Guzik <mjg@FreeBSD.org>

Fix tinderbox build after 1762f674ccb571e6 ktrace commit.


# ea2b64c2 18-May-2021 Konstantin Belousov <kib@FreeBSD.org>

ktrace: add a kern.ktrace.filesize_limit_signal knob

When enabled, writes to ktrace.out that exceed the max file size limit
cause SIGXFSZ as it should be, but note that the limit is taken from
the process that initiated ktrace. When disabled, write is blocked,
but signal is not send.

Note that in either case ktrace for the affected process is stopped.

Requested and reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30257


# 02645b88 14-May-2021 Konstantin Belousov <kib@FreeBSD.org>

ktrace: use the limit of the trace initiator for file size limit on writes

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30257


# 9bb84c23 13-May-2021 Konstantin Belousov <kib@FreeBSD.org>

accounting: explicitly mark the exiting thread as doing accounting

and use the mark to stop applying file size limits on the write of
the accounting record. This allows to remove hack to clear process
limits in acct_process(), and avoids the bug with the clearing being
ineffective because limits are also cached in the thread structure.

Reported and reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30257


# ca1ce50b 14-May-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add more safety against concurrent forced unmount to vn_write

1. stop re-reading ->v_mount (can become NULL)
2. stop re-reading ->v_type (can change to VBAD)


# 6de3cf14 12-May-2021 Konstantin Belousov <kib@FreeBSD.org>

vn_open_cred(): disallow O_CREAT | O_EMPTY_PATH

This combination does not make sense, and cannot be satisfied by lookup.
In particular, lookup cannot supply dvp, it only can directly return vp.

Reported and reviewed by: markj using syzkaller
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 5e7cdf18 06-May-2021 Konstantin Belousov <kib@FreeBSD.org>

openat(2): add O_EMPTY_PATH

It reopens the passed file descriptor, checking the file backing vnode'
current access rights against open mode. In particular, this flag allows
to convert file descriptor opened with O_PATH, into operable file
descriptor, assuming permissions allow that.

Reviewed by: markj
Tested by: Andrew Walker <awalker@ixsystems.com>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30148


# 4f592683 02-May-2021 Rick Macklem <rmacklem@FreeBSD.org>

copy_file_range(2): improve copying of a large hole to EOF

PR#255523 reported that a file copy for a file with a large hole
to EOF on ZFS ran slowly over NFSv4.2.
The problem was that vn_generic_copy_file_range() would
loop around reading the hole's data and then see it is all
0s. It was coded this way since UFS always allocates a data
block near the end of the file, such that a hole to EOF never exists.

This patch modifies vn_generic_copy_file_range() to check for a
ENXIO returned from VOP_IOCTL(..FIOSEEKDATA..) and handle that
case as a hole to EOF. asomers@ confirms that it works for his
ZFS test case.

PR: 255523
Tested by: asomers
Reviewed by: asomers
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D30076


# 20825657 28-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

O_PATH: disable kqfilter for fifos

Filter on fifos is real filter for the object, and not a filesystem
events filter like EVFILT_VNODE.

Reported by: markj using syzkaller
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 54f98c4d 19-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

vn_open_vnode(): handle error when fp == NULL

If VOP_ADD_WRITECOUNT() or adv locking failed, so VOP_CLOSE() needs to
be called, we cannot use fp fo_close() when there is no fp. This occurs
when e.g. kernel code directly calls vn_open() instead of the open(2)
syscall.

In this case, VOP_CLOSE() can be called directly, after possible lock
upgrade.

Reported by: nvass@gmx.com
PR: 255119
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D29830


# bbf7a4e8 07-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

O_PATH: allow vnode kevent filter on such files

if VREAD access is checked as allowed during open

Requested by: wulf
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29323


# f9b923af 06-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

O_PATH: Allow to open symlink

When O_NOFOLLOW is specified, namei() returns the symlink itself. In
this case, open(O_PATH) should be allowed, to denote the location of symlink
itself.

Prevent O_EXEC in this case, execve(2) code is not ready to try to execute
symlinks.

Reported by: wulf
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29323


# 8d9ed174 17-Mar-2021 Konstantin Belousov <kib@FreeBSD.org>

open(2): Implement O_PATH

Reviewed by: markj
Tested by: pho
Discussed with: walker.aj325_gmail.com, wulf
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29323


# 437c241d 17-Mar-2021 Konstantin Belousov <kib@FreeBSD.org>

vfs_vnops.c: Make vn_statfile() non-static

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29323


# 42be0a7b 17-Mar-2021 Konstantin Belousov <kib@FreeBSD.org>

Style.

Add missed spaces, wrap long lines.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29323


# 20e91ca3 15-Feb-2021 Konstantin Belousov <kib@FreeBSD.org>

open(2): Remove O_BENEATH and AT_BENEATH

with the reasoning that the flags did not worked properly, and were not
shipped in a release.

O_RESOLVE_BENEATH is kept as useful.

Reviewed by: markj
Tested by: arichardson, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D28907


# a5f9fe2b 01-Mar-2021 Rick Macklem <rmacklem@FreeBSD.org>

copy_file_range(2): Fix for small values of input file offset and len

r366302 broke copy_file_range(2) for small values of
input file offset and len.

It was possible for rem to be greater than len and then
"len - rem" was a large value, since both variables are
unsigned.

Reported by: koobs, Pablo <pablogsal gmail com> (Python)
Reviewed by: asomers, koobs
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D28981


# fa3bd463 29-Jan-2021 Konstantin Belousov <kib@FreeBSD.org>

lockf: ensure atomicity of lockf for open(O_CREAT|O_EXCL|O_EXLOCK)

or EX_SHLOCK. Do it by setting a vnode iflag indicating that the locking
exclusive open is in progress, and not allowing F_LOCK request to make
a progress until the first open finishes.

Requested by: mckusick
Reviewed by: markj, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28697


# c61fae14 14-Feb-2021 Konstantin Belousov <kib@FreeBSD.org>

pgcache read: protect against reads past end of the vm object size

If uio_offset is past end of the object size, calculated resid is negative.
Delegate handling this case to the locked read, as any other non-trivial
situation.

PR: 253158
Reported by: Harald Schmalzbauer <bugzilla.freebsd@omnilan.de>
Tested by: cy
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 3b2aa360 28-Jan-2021 Konstantin Belousov <kib@FreeBSD.org>

Use VOP_VPUT_PAIR() for eligible VFS syscalls.

The current list is limited to the cases where UFS needs to handle
vput(dvp) specially. Which means VOP_CREATE(), VOP_MKDIR(), VOP_MKNOD(),
VOP_LINK(), and VOP_SYMLINK().

Reviewed by: chs, mkcusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# ee965dfa 03-Feb-2021 Konstantin Belousov <kib@FreeBSD.org>

vn_open(): If the vnode is reclaimed during open(2), do not return error.

Most future operations on the returned file descriptor will fail
anyway, and application should be ready to handle that failures. Not
forcing it to understand the transient failure mode on open, which is
implementation-specific, should make us less special without loss of
reporting of errors.

Suggested by: chs
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# ef23df13 13-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: set NC_KEEPPOSENTRY alongside NOCACHE when creating a file

Arguably the entire NOCACHE logic should get retired, in the meantime
at least prevent the code from evicting existing entries.


# a5e28403 07-Jan-2021 Thomas Munro <tmunro@FreeBSD.org>

open(2): Add O_DSYNC flag.

POSIX O_DSYNC means that writes include an implicit fdatasync(2), just
as O_SYNC implies fsync(2).

VOP_WRITE() functions that understand the new IO_DATASYNC flag can act
accordingly, but we'll still pass down IO_SYNC so that file systems that
don't understand it will continue to provide the stronger O_SYNC
behaviour.

Flag also applies to fcntl(2).

Reviewed by: kib, delphij
Differential Revision: https://reviews.freebsd.org/D25090


# 11403bde 06-Jan-2021 Chuck Silvers <chs@FreeBSD.org>

vfs: fix rangelock range in vn_rdwr() for IO_APPEND

vn_rdwr() must lock the entire file range for IO_APPEND
just like vn_io_fault() does for O_APPEND.

Reviewed by: kib, imp, mckusick
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D28008


# 3e506a67 27-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add v_irflag accessors

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D27793


# 99c66d3a 26-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

vn_read_from_obj(): fix handling of doomed vnodes.

There is no reason why vp->v_object cannot be NULL. If it is, it's
fine, handle it by delegating to VOP_READ().

Tested by: pho
Reviewed by: markj, mjg
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27327


# 3b1f974b 26-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

Make max ticks for pause in vn_lock_pair() adjustable at runtime.

Reduce default value from hz / 10 to hz / 100.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation


# e75f0f2b 20-Nov-2020 Kirk McKusick <mckusick@FreeBSD.org>

Only attempt a VOP_UNLOCK() when the vn_lock() has been successful.

No MFC as this code is not present in 12-stable.

Reported by: Peter Holm
Reviewed by: Mateusz Guzik
Tested by: Peter Holm
Sponsored by: Netflix


# 7cde2ec4 13-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

Implement vn_lock_pair().

In collaboration with: pho
Reviewed by: mckusick (previous version), markj (previous version)
Tested by: markj (syzkaller), pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26136


# f6dd1aef 09-Nov-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: group mount per-cpu vars into one struct

While here move frequently read stuff into the same cacheline.

This shrinks struct mount by 64 bytes.

Tested by: pho


# eebc2e45 28-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add NDREINIT to facilitate repeated namei calls

struct nameidata mixes caller arguments, internal state and output, which
can be quite error prone.

Recent addition of valdiating ni_resflags uncovered a caller which could
repeatedly call namei, effectively operating on partially populated state.

Add bare minimium validation this does not happen. The real fix would
decouple aforementioned state.

Reported by: pho
Tested by: pho (different variant)


# deb1339f 09-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fix a panic when truncating comming from copy_file_range

Truncating requires an exclusive lock, but it was not taken if the
filesystem indicates support for shared writes. This only concerns
ZFS.

In particular fixes cp of files which have trailing holes.

Reported by: bdrewery


# 19fe23fa 08-Oct-2020 Rick Macklem <rmacklem@FreeBSD.org>

Make vn_generic_copy_file_range() interruptible via a signal.

Without this patch, when vn_generic_copy_file_range() is
doing a large copy, it will remain in the function for a
considerable amount of time, delaying handling of any
outstanding signals until the copy completes.

This patch adds checks for signals that need to be
processed after each successful data copy cycle.
When sig_intr() returns non-zero, vn_generic_copy_file_range()
will return.
The check "if (len < savlen)" ensures that some data
has been copied, so that progress will be made.

Note that, since copy_file_range(2) is allowed to
return fewer bytes copied than requested, it
will never return EINTR/ERESTART when sig_intr()
returns non-zero.

Reviewed by: kib, asomers
Differential Revision: https://reviews.freebsd.org/D26620


# 961afe3c 30-Sep-2020 Rick Macklem <rmacklem@FreeBSD.org>

Clip the "len" argument to vn_generic_copy_file_range() at a
hole size boundary.

By clipping the len argument of vn_generic_copy_file_range() to end at
an exact multiple of hole size, holes are more likely to be maintained
during the copy.
A hole can still straddle the boundary at the end of the
copy range, resulting in a block being allocated in the
output file as it is being grown in size, but this will reduce the
likelyhood of this happening.

While here, also modify setting of blksize to better handle the
case where _PC_MIN_HOLE_SIZE is returned as 1.

Reviewed by: asomers
Differential Revision: https://reviews.freebsd.org/D26570


# 164aa1e9 29-Sep-2020 Rick Macklem <rmacklem@FreeBSD.org>

Make copy_file_range(2) Linux compatible for overflow of offset + len.

Without this patch, if a call to copy_file_range(2) specifies an input file
offset + len that would wrap around, EINVAL is returned.
I thought that was the Linux behaviour, but recent testing showed that
Linux accepts this case and does the copy_file_range() to EOF.

This patch changes the FreeBSD code to exhibit the same behaviour as
Linux for this case.

Reviewed by: asomers, kib
Differential Revision: https://reviews.freebsd.org/D26569


# 1317da43 22-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Add O_RESOLVE_BENEATH and AT_RESOLVE_BENEATH to mimic Linux' RESOLVE_BENEATH.

It is like O_BENEATH, but disables to walk out of the subtree rooted
in the starting directory. O_BENEATH does not care if path walks out
if it returned.

Requested by: Dan Gohman <dev@sunfishcode.online>
PR: 248335
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25886


# 4a0b316d 22-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Add open2nameif()

the helper to calculate namei flags both for open(2) and creat(2).

Suggested and reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25886


# f9cc8410 18-Sep-2020 Eric van Gyzen <vangyzen@FreeBSD.org>

vm_ooffset_t is now unsigned

vm_ooffset_t is now unsigned. Remove some tests for negative values,
or make other adjustments accordingly.

Reported by: Coverity
Reviewed by: kib markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D26214


# 3c484f32 15-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Convert page cache read to VOP.

There are several negative side-effects of not calling into VOP layer
at all for page cache reads. The biggest is the missed activation of
EVFILT_READ knotes.

Also, it allows filesystem to make more fine grained decision to
refuse read from page cache.

Keep VIRF_PGREAD flag around, it is still useful for nullfs, and for
asserts.

Reviewed by: markj
Tested by: pho
Discussed with: mjg
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26346


# 88863665 15-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

vfs_subr.c: export io_hold_cnt and vn_read_from_obj().

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26346


# feabaaf9 24-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: drop the always curthread argument from reverse lookup routines

Note VOP_VPTOCNP keeps getting it as temporary compatibility for zfs.

Tested by: pho


# beb27033 16-Aug-2020 Konstantin Belousov <kib@FreeBSD.org>

Fix powerpc build.

Sponsored by: The FreeBSD Foundation


# fbca789f 16-Aug-2020 Konstantin Belousov <kib@FreeBSD.org>

VMIO read

If possible, i.e. if the requested range is resident valid in the vm
object queue, and some secondary conditions hold, copy data for
read(2) directly from the valid cached pages, avoiding vnode lock and
instantiating buffers. I intentionally do not start read-ahead, nor
handle the advises on the cached range.

Filesystems indicate support for VMIO reads by setting VIRF_PGREAD
flag, which must not be cleared until vnode reclamation.

Currently only filesystems that use vnode pager for v_objects can
enable it, due to reliance on vnp_size. There is a WIP to handle it
for tmpfs.

Reviewed by: markj
Discussed with: jeff
Tested by: pho
Benchmarked by: mjg
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D25968


# 51ea7bea 07-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add VOP_STAT

The current scheme of calling VOP_GETATTR adds avoidable overhead.

An example with tmpfs doing fstat (ops/s):
before: 7488958
after: 7913833

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D25910


# 3ea3fbe6 16-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fix vn_poll performance with either MAC or AUDIT

The code would unconditionally lock the vnode to audit or call the
mac hoook, even if neither want to do anything. Pre-check the state
to avoid locking in the common case of nothing to do.

Note this code should not be normally executed anyway as vnodes are
always return ready. However, poll1/2 from will-it-scale use regular
files for benchmarking, presumably to focus on the interface itself
as the vnode handler is not supposed to do almost anything.

This in particular fixes poll2 which passes 128 fds.

$ ./poll2_processes -s 10
before: 134411
after: 271572


# ab06a305 16-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fix MAC/AUDIT mismatch in vn_poll

Auditing would not be performed without MAC compiled in.


# 422f38d8 10-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fix trivial whitespace issues which don't interefere with blame

.. even without the -w switch


# 4543c1c3 05-Jul-2020 Konstantin Belousov <kib@FreeBSD.org>

Fix typo.

Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# f2706588 21-Jun-2020 Thomas Munro <tmunro@FreeBSD.org>

vfs: track sequential reads and writes separately

For software like PostgreSQL and SQLite that sometimes reads sequentially
while also writing sequentially some distance behind with interleaved
syscalls on the same fd, performance is better on UFS if we do
sequential access heuristics separately for reads and writes.

Patch originally by Andrew Gierth in 2008, updated and proposed by me with
his permission.

Reviewed by: mjg, kib, tmunro
Approved by: mjg (mentor)
Obtained from: Andrew Gierth <andrew@tao11.riddles.org.uk>
Differential Revision: https://reviews.freebsd.org/D25024


# 63619b6d 04-Jun-2020 Kyle Evans <kevans@FreeBSD.org>

vfs: add restrictions to read(2) of a directory [2/2]

This commit adds the priv(9) that waters down the sysctl to make it only
allow read(2) of a dirfd by the system root. Jailed root is not allowed, but
jail policy and superuser policy will abstain from allowing/denying it so
that a MAC module can fully control the policy.

Such a MAC module has been written, and can be found at:
https://people.freebsd.org/~kevans/mac_read_dir-0.1.0.tar.gz

It is expected that the MAC module won't be needed by many, as most only
need to do such diagnostics that require this behavior as system root
anyways. Interested parties are welcome to grab the MAC module above and
create a port or locally integrate it, and with enough support it could see
introduction to base. As noted in mac_read_dir.c, it is released under the
BSD 2 clause license and allows the restrictions to be lifted for only
jailed root or for all unprivileged users.

PR: 246412
Reviewed by: mckusick, kib, emaste, jilles, cy, phk, imp (all previous)
Reviewed by: rgrimes (latest version)
Differential Revision: https://reviews.freebsd.org/D24596


# dcef4f65 04-Jun-2020 Kyle Evans <kevans@FreeBSD.org>

vfs: add restrictions to read(2) of a directory [1/2]

Historically, we've allowed read() of a directory and some filesystems will
accommodate (e.g. ufs/ffs, msdosfs). From the history department staffed by
Warner: <<EOF

pdp-7 unix seemed to allow reading directories, but they were weird, special
things there so I'm unsure (my pdp-7 assembler sucks).

1st Edition's sources are lost, mostly. The kernel allows it. The
reconstructed sources from 2nd or 3rd edition read it though.

V6 to V7 changed the filesystem format, and should have been a warning, but
reading directories weren't materially changed.

4.1b BSD introduced readdir because of UFS. UFS broke all directory reading
programs in 1983. ls, du, find, etc all had to be rewritten. readdir() and
friends were introduced here.

SysVr3 picked up readdir() in 1987 for the AT&T fork of Unix. SysVr4 updated
all the directory reading programs in 1988 because different filesystem
types were introduced.

In the 90s, these interfaces became completely ubiquitous as PDP-11s running
V7 faded from view and all the folks that initially started on V7 upgraded
to SysV. Linux never supported this (though I've not done the software
archeology to check) because it has always had a pathological diversity of
filesystems.
EOF

Disallowing read(2) on a directory has the side-effect of masking
application bugs from relying on other implementation's behavior
(e.g. Linux) of rejecting these with EISDIR across the board, but allowing
it has been a vector for at least one stack disclosure bug in the past[0].

By POSIX, this is implementation-defined whether read() handles directories
or not. Popular implementations have chosen to reject them, and this seems
sensible: the data you're reading from a directory is not structured in some
unified way across filesystem implementations like with readdir(2), so it is
impossible for applications to portably rely on this.

With this patch, we will reject most read(2) of a dirfd with EISDIR. Users
that know what they're doing can conscientiously set
bsd.security.allow_read_dir=1 to allow read(2) of directories, as it has
proven useful for debugging or recovery. A future commit will further limit
the sysctl to allow only the system root to read(2) directories, to make it
at least relatively safe to leave on for longer periods of time.

While we're adding logic pertaining to directory vnodes to vn_io_fault, an
additional assertion has also been added to ensure that we're not reaching
vn_io_fault with any write request on a directory vnode. Such request would
be a logical error in the kernel, and must be debugged rather than allowing
it to potentially silently error out.

Commented out shell aliases have been placed in root's chsrc/shrc to promote
awareness that grep may become noisy after this change, depending on your
usage.

A tentative MFC plan has been put together to try and make it as trivial as
possible to identify issues and collect reports; note that this will be
strongly re-evaluated. Tentatively, I will MFC this knob with the default as
it is in HEAD to improve our odds of actually getting reports. The future
priv(9) to further restrict the sysctl WILL NOT BE MERGED BACK, so the knob
will be a faithful reversion on stable/12. We will go into the merge
acknowledging that the sysctl default may be flipped back to restore
historical behavior at *any* point if it's warranted.

[0] https://www.freebsd.org/security/advisories/FreeBSD-SA-19:10.ufs.asc

PR: 246412
Reviewed by: mckusick, kib, emaste, jilles, cy, phk, imp (all previous)
Reviewed by: rgrimes (latest version)
MFC after: 1 month (note the MFC plan mentioned above)
Relnotes: absolutely, but will amend previous RELNOTES entry
Differential Revision: https://reviews.freebsd.org/D24596


# e3d16bb6 24-May-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: use atomic_{store,load}_long to manage f_offset

... instead of depending on the compiler not to mess them up


# 442e617f 24-May-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: restore mtx-protected foffset locking for 32 bit platforms

They depend on it to accurately read the offset.

The new code is not used as it would add an interrupt enable/disable
trip on top of the atomic.

This also fixes a bug where 32-bit nolock request would still lock the offset.

No changes for 64-bit.

Reported by: emaste


# 3fc40153 23-May-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: scale foffset_lock by using atomics instead of serializing on mtx pool

Contending cases still serialize on sleepq (which would be taken anyway).

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21626


# 90f29198 09-Apr-2020 Konstantin Belousov <kib@FreeBSD.org>

Remove extra call to vfs_op_exit() from vfs_write_suspend() when VFS_SYNC() fails.

The vfs_write_resume() handler already does vfs_op_exit() for us.

Reported by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation


# 2782c00c 22-Feb-2020 Ryan Libby <rlibby@FreeBSD.org>

vfs: quiet -Wwrite-strings

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23797


# 074ad60a 15-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: make write suspension mandatory

At the time opt-in was introduced adding yourself as a writer was esrializing
across the mount point. Nowadays it is fully per-cpu, the only impact being
a small single-threaded hit on top of what's there right now.

Vast majority of the overhead stems from the call to VOP_GETWRITEMOUNT which
has is done regardless.

Should someone want to microoptimize this single-threaded they can coalesce
looking the mount up with adding a write to it.


# 7b2ff0dc 13-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

Partially decompose priv_check by adding priv_check_cred_vfs_generation

During buildkernel there are very frequent calls to priv_check and they
all are for PRIV_VFS_GENERATION (coming from stat/fstat).

This results in branching on several potential privileges checking if
perhaps that's the one which has to be evaluated.

Instead of the kitchen-sink approach provide a way to have commonly used
privs directly evaluated.


# 2f7f11b7 08-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: tidy up vget_finish and vn_lock

- remove assertion which duplicates vn_lock
- use VNPASS instead of retyping the failure
- report what flags were passed if panicking on them


# 3ff65f71 30-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

Remove duplicated empty lines from kern/*.c

No functional changes.


# 2856d85e 08-Jan-2020 Kyle Evans <kevans@FreeBSD.org>

posix_fallocate: push vnop implementation into the fileop layer

This opens the door for other descriptor types to implement
posix_fallocate(2) as needed.

Reviewed by: kib, bcr (manpages)
Differential Revision: https://reviews.freebsd.org/D23042


# 7e2ea577 04-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: factor out avoidable branches in _vn_lock


# b249ce48 03-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the mostly unused flags argument from VOP_UNLOCK

Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427


# abd80ddb 08-Dec-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: introduce v_irflag and make v_type smaller

The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.

v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.

Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715


# fdc6b10d 29-Nov-2019 Konstantin Belousov <kib@FreeBSD.org>

Add a VN_OPEN_INVFS flag.

vn_open_cred() assumes that it is called from the top-level of a VFS
syscall. Writers must call bwillwrite() before locking any VFS
resource to wait for cleanup of dirty buffers.

ZFS getextattr() and setextattr() VOPs do call vn_open_cred(), which
results in wait for unrelated buffers while owning ZFS vnode lock (and
ZFS does not use buffer cache). VN_OPEN_INVFS allows caller to skip
bwillwrite.

Note that ZFS is still incorrect there, because it starts write on an
mp and locks a vnode while holding another vnode lock.

Reported by: Willem Jan Withagen <wjw@digiware.nl>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 48e48578 09-Nov-2019 Rick Macklem <rmacklem@FreeBSD.org>

Update copy_file_range(2) to be Linux5 compatible.

The current linux man page and testing done on a fairly recent linux5.n
kernel have identified two changes to the semantics of the linux
copy_file_range system call.
Since the copy_file_range(2) system call is intended to be linux compatible
and is only currently in head/current and not used by any commands,
it seems appropriate to update the system call to be compatible with
the current linux one.
The first of these semantic changes was changed to be compatible with
linux5.n by r354564.
For the second semantic change, the old linux man page stated that, if
infd and outfd referred to the same file, EBADF should be returned.
Now, the semantics is to allow infd and outfd to refer to the same file
so long as the byte ranges defined by the input file offset, output file offset
and len does not overlap. If the byte ranges do overlap, EINVAL should be
returned.
This patch modifies copy_file_range(2) to be linux5.n compatible for this
semantic change.


# 15930ae1 08-Nov-2019 Rick Macklem <rmacklem@FreeBSD.org>

Update copy_file_range(2) to be Linux5 compatible.

The current linux man page and testing done on a fairly recent linux5.n
kernel have identified two changes to the semantics of the linux
copy_file_range system call.
Since the copy_file_range(2) system call is intended to be linux compatible
and is only currently in head/current and not used by any commands,
it seems appropriate to update the system call to be compatible with
the current linux one.
The old linux man page stated that, if the
offset + len exceeded file_size for the input file, EINVAL should be returned.
Now, the semantics is to copy up to at most file_size bytes and return that
number of bytes copied. If the offset is at or beyond file_size, a return
of 0 bytes is done.
This patch modifies copy_file_range(2) to be linux compatible for this
semantic change.
A separate patch will change copy_file_range(2) for the other semantic
change, which allows the infd and outfd to refer to the same file, so
long as the byte ranges do not overlap.


# 9bb37c03 16-Oct-2019 Andrew Turner <andrew@FreeBSD.org>

Stop leaking information from the kernel through timespec

The timespec struct holds a seconds value in a time_t and a nanoseconds
value in a long. On most architectures these are the same size, however
on 32-bit architectures other than i386 time_t is 8 bytes and long is
4 bytes.

Most ABIs will then pad a struct holding an 8 byte and 4 byte value to
16 bytes with 4 bytes of padding. When copying one of these structs the
compiler is free to copy the padding if it wishes.

In this case the padding may contain kernel data that is then leaked to
userspace. Fix this by copying the timespec elements rather than the
entire struct.

This doesn't affect Tier-1 architectures so no SA is expected.

admbugs: 651
MFC after: 1 week
Sponsored by: DARPA, AFRL


# 55894117 17-Sep-2019 Konstantin Belousov <kib@FreeBSD.org>

Return EISDIR when directory is opened with O_CREAT without O_DIRECTORY.

Reviewed by: bcr (man page), emaste (previous version)
PR: 240452
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
DIfferential revision: https://reviews.freebsd.org/D21634


# 4cace859 16-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: convert struct mount counters to per-cpu

There are 3 counters modified all the time in this structure - one for
keeping the structure alive, one for preventing unmount and one for
tracking active writers. Exact values of these counters are very rarely
needed, which makes them a prime candidate for conversion to a per-cpu
scheme, resulting in much better performance.

Sample benchmark performing fstatfs (modifying 2 out of 3 counters) on
a 104-way 2 socket Skylake system:
before: 852393 ops/s
after: 76682077 ops/s

Reviewed by: kib, jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21637


# e87f3f72 16-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: manage mnt_writeopcount with atomics

See r352424.

Reviewed by: kib, jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21575


# fe7bcbaf 03-Sep-2019 Kyle Evans <kevans@FreeBSD.org>

vm pager: writemapping accounting for OBJT_SWAP

Currently writemapping accounting is only done for vnode_pager which does
some accounting on the underlying vnode.

Extend this to allow accounting to be possible for any of the pager types.
New pageops are added to update/release writecount that need to be
implemented for any pager wishing to do said accounting, and we implement
these methods now for both vnode_pager (unchanged) and swap_pager.

The primary motivation for this is to allow other systems with OBJT_SWAP
objects to check if their objects have any write mappings and reject
operations with EBUSY if so. posixshm will be the first to do so in order to
reject adding write seals to the shmfd if any writable mappings exist.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D21456


# 95acb40c 27-Aug-2019 Konstantin Belousov <kib@FreeBSD.org>

vn_vget_ino_gen(): relock the lower vnode on error.

The function' interface assumes that the lower vnode is passed and
returned locked always.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# df9bc7df 21-Aug-2019 Rick Macklem <rmacklem@FreeBSD.org>

Map ENOTTY to EINVAL for lseek(SEEK_DATA/SEEK_HOLE).

Without this patch, when an application performed lseek(SEEK_DATA/SEEK_HOLE)
on a file in a file system that does not have its own VOP_IOCTL(), the
lseek(2) fails with errno ENOTTY. This didn't seem appropriate, since
ENOTTY is not listed as an error return by either the lseek(2) man page
nor the POSIX draft for lseek(2).
This was discussed on freebsd-current@ here:
http://docs.FreeBSD.org/cgi/mid.cgi?CAOtMX2iiQdv1+15e1N_r7V6aCx_VqAJCTP1AW+qs3Yg7sPg9wA

This trivial patch maps ENOTTY to EINVAL for lseek(SEEK_DATA/SEEK_HOLE).

Reviewed by: markj
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D21300


# c61b1431 15-Aug-2019 Rick Macklem <rmacklem@FreeBSD.org>

Fix copy_file_range(2) so that unneeded blocks are not allocated to the output file.

When the byte range for copy_file_range(2) doesn't go to EOF on the
output file and there is a hole in the input file, a hole must be
"punched" in the output file. This is done by writing a block of bytes
all set to 0.
Without this patch, the write is done unconditionally which means that,
if the output file already has a hole in that byte range, a unneeded data block
of all 0 bytes would be allocated.
This patch adds code to check for a hole in the output file, so that it can
skip doing the write if there is already a hole in that byte range of
the output file. This avoids unnecessary allocation of blocks to the
output file.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D21155


# 6b1bc6f7 08-Aug-2019 Rick Macklem <rmacklem@FreeBSD.org>

Remove some harmless cruft from vn_generic_copy_file_range().

An earlier version of the patch had code that set "error" between
line#s 2797-2799. When that code was moved, the second check for "error != 0"
could never be true and the check became harmless cruft.
This patch removes the cruft, mainly to make Coverity happy.

Reported by: asomers, cem


# 61463314 08-Aug-2019 Rick Macklem <rmacklem@FreeBSD.org>

Fix copy_file_range(2) for an unlikely race during hole finding.

Since the VOP_IOCTL(FIOSEEKDATA/FIOSEEKHOLE) calls are done with the
vnode unlocked, it is possible for another thread to do:
- truncate(), lseek(), write()
between the two calls and create a hole where FIOSEEKDATA returned the
start of data.
For this case, VOP_IOCTL(FIOSEEKHOLE) will return the same offset for
the hole location. This could result in an infinite loop in the copy
code, since copylen is set to 0 and the copy doesn't advance.
Usually, this race is avoided because of the use of rangelocks, but the
NFS server does not do range locking and could do a sequence like the
above to create the hole.

This patch checks for this case and makes the hole search fail, to avoid
the infinite loop.

At this time, it is an open question as to whether or not the NFS server
should do range locking to avoid this race.


# bbbbeca3 24-Jul-2019 Rick Macklem <rmacklem@FreeBSD.org>

Add kernel support for a Linux compatible copy_file_range(2) syscall.

This patch adds support to the kernel for a Linux compatible
copy_file_range(2) syscall and the related VOP_COPY_FILE_RANGE(9).
This syscall/VOP can be used by the NFSv4.2 client to implement the
Copy operation against an NFSv4.2 server to do file copies locally on
the server.
The vn_generic_copy_file_range() function in this patch can be used
by the NFSv4.2 server to implement the Copy operation.
Fuse may also me able to use the VOP_COPY_FILE_RANGE() method.

vn_generic_copy_file_range() attempts to maintain holes in the output
file in the range to be copied, but may fail to do so if the input and
output files are on different file systems with different _PC_MIN_HOLE_SIZE
values.

Separate commits will be done for the generated syscall files and userland
changes. A commit for a compat32 syscall will be done later.

Reviewed by: kib, asomers (plus comments by brooks, jilles)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D20584


# 0122532e 17-Jul-2019 Alan Somers <asomers@FreeBSD.org>

F_READAHEAD: Fix r349248's overflow protection, broken by r349391

I accidentally broke the main point of r349248 when making stylistic changes
in r349391. Restore the original behavior, and also fix an additional
overflow that was possible when uio->uio_resid was nearly SSIZE_MAX.

Reported by: cem
Reviewed by: bde
MFC after: 2 weeks
MFC-With: 349248
Sponsored by: The FreeBSD Foundation


# 555d8f28 01-Jul-2019 Rick Macklem <rmacklem@FreeBSD.org>

Factor out the code that does a VOP_SETATTR(size) from vn_truncate().

This patch factors the code in vn_truncate() that does the actual
VOP_SETATTR() of size into a separate function called vn_truncate_locked().
This will allow the NFS server and the patch that adds a
copy_file_range(2) syscall to call this function instead of duplicating
the code and carrying over changes, such as the recent r347151.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D20808


# 0cfc1ef3 27-Jun-2019 Alan Somers <asomers@FreeBSD.org>

FIOBMAP2: inline vn_ioc_bmap2

Reported by: kib
Reviewed by: kib
MFC after: 2 weeks
MFC-With: 349238
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20783


# 4f53d57e 25-Jun-2019 Alan Somers <asomers@FreeBSD.org>

fcntl: style changes to r349248

Reported by: bde
MFC after: 2 weeks
MFC-With: 349248
Sponsored by: The FreeBSD Foundation


# 38b06f8a 20-Jun-2019 Alan Somers <asomers@FreeBSD.org>

fcntl: fix overflow when setting F_READAHEAD

VOP_READ and VOP_WRITE take the seqcount in blocks in a 16-bit field.
However, fcntl allows you to set the seqcount in bytes to any nonnegative
31-bit value. The result can be a 16-bit overflow, which will be
sign-extended in functions like ffs_read. Fix this by sanitizing the
argument in kern_fcntl. As a matter of policy, limit to IO_SEQMAX rather
than INT16_MAX.

Also, fifos have overloaded the f_seqcount field for a completely different
purpose ever since r238936. Formalize that by using a union type.

Reviewed by: cem
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20710


# d49b446b 20-Jun-2019 Alan Somers <asomers@FreeBSD.org>

Add FIOBMAP2 ioctl

This ioctl exposes VOP_BMAP information to userland. It can be used by
programs like fragmentation analyzers and optimized cp implementations. But
I'm using it to test fusefs's VOP_BMAP implementation. The "2" in the name
distinguishes it from the similar but incompatible FIBMAP ioctls in NetBSD
and Linux. FIOBMAP2 differs from FIBMAP in that it uses a 64-bit block
number instead of 32-bit, and it also returns runp and runb.

Reviewed by: mckusick
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20705


# daec9284 21-May-2019 Conrad Meyer <cem@FreeBSD.org>

Include ktr.h in more compilation units

Similar to r348026, exhaustive search for uses of CTRn() and cross reference
ktr.h includes. Where it was obvious that an OS compat header of some kind
included ktr.h indirectly, .c files were left alone. Some of these files
clearly got ktr.h via header pollution in some scenarios, or tinderbox would
not be passing prior to this revision, but go ahead and explicitly include it
in files using it anyway.

Like r348026, these CUs did not show up in tinderbox as missing the include.

Reported by: peterj (arm64/mp_machdep.c)
X-MFC-With: r347984
Sponsored by: Dell EMC Isilon


# 78022527 05-May-2019 Konstantin Belousov <kib@FreeBSD.org>

Switch to use shared vnode locks for text files during image activation.

kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.

The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal

The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.

nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.

On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.

Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923


# ae909414 09-Apr-2019 Konstantin Belousov <kib@FreeBSD.org>

Add vn_fsync_buf().

Provide a convenience function to avoid the hack with filling fake
struct vop_fsync_args and then calling vop_stdfsync().

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 4ae3f5a7 05-Apr-2019 Konstantin Belousov <kib@FreeBSD.org>

vn_vmap_seekhole(): align running offset to the block boundary.

Otherwise we might miss the last iteration where EOF appears below
unaligned noff.

Reported and reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19811


# 4f77f488 25-Oct-2018 Konstantin Belousov <kib@FreeBSD.org>

Implement O_BENEATH and AT_BENEATH.

Flags prevent open(2) and *at(2) vfs syscalls name lookup from
escaping the starting directory. Supposedly the interface is similar
to the same proposed Linux flags.

Reviewed by: jilles (code, previous version of manpages), 0mp (manpages)
Discussed with: allanjude, emaste, jonathan
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D17547


# c9e562b1 11-Sep-2018 Gordon Tetlow <gordon@FreeBSD.org>

Correct ELF header parsing code to prevent invalid ELF sections from
disclosing memory.

Submitted by: markj
Reported by: Thomas Barabosch, Fraunhofer FKIE
Approved by: re (implicit)
Approved by: so
Security: FreeBSD-SA-18:12.elf
Security: CVE-2018-6924
Sponsored by: The FreeBSD Foundation


# b8d908b7 01-Jun-2018 Ed Maste <emaste@FreeBSD.org>

ANSIfy sys/kern


# b99aa0fb 29-May-2018 Matt Macy <mmacy@FreeBSD.org>

hwpmc: don't enter epoch section across mmap hook


# 161bf65f 24-Mar-2018 Konstantin Belousov <kib@FreeBSD.org>

In vn_io_fault1(), reduce the scope where pagefaults are disabled.

Most important for the future use, do not call
vm_fault_quick_hold_pages() with disabled pagefaults.

Reported and tested by: pho (as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# 03311f11 27-May-2017 Konstantin Belousov <kib@FreeBSD.org>

Use whole mnt_stat.f_fsid bits for st_dev.

Since ino64 expanded dev_t to 64bit, make VOP_GETATTR(9) provide all
bits of mnt_stat.f_fsid as va_fsid for vnodes on filesystems which use
f_fsid. In particular, NFSv3 and sometimes NFSv4, and ZFS use this
method or reporting st_dev by stat(2).

Provide a new helper vn_fsid() to avoid duplicating code to copy
f_fsid to va_fsid.

Note that the change is mostly cosmetic. Its motivation is to avoid
sign-extension of f_fsid[0] into 64bit dev_t value which happens after
dev_t becomes 64bit..

Reviewed by: avg(zfs), rmacklem (nfs) (both for previous version)
Sponsored by: The FreeBSD Foundation


# 69921123 23-May-2017 Konstantin Belousov <kib@FreeBSD.org>

Commit the 64-bit inode project.

Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment. Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.

ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks. Unfortunately, not everything can be
fixed, especially outside the base system. For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.

Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.

Struct xvnode changed layout, no compat shims are provided.

For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.

Update note: strictly follow the instructions in UPDATING. Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.

Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver. Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).

Sponsored by: The FreeBSD Foundation (emaste, kib)
Differential revision: https://reviews.freebsd.org/D10439


# 3e85b721 16-May-2017 Ed Maste <emaste@FreeBSD.org>

Remove register keyword from sys/ and ANSIfy prototypes

A long long time ago the register keyword told the compiler to store
the corresponding variable in a CPU register, but it is not relevant
for any compiler used in the FreeBSD world today.

ANSIfy related prototypes while here.

Reviewed by: cem, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D10193


# ecc6c515 19-Feb-2017 Konstantin Belousov <kib@FreeBSD.org>

Apply noexec mount option for mmap(PROT_EXEC).

Right now the noexec mount option disallows image activators to try
execve the files on the mount point. Also, after r127187, noexec
also limits max_prot map entries permissions for mappings of files
from such mounts, but not the actual mapping permissions.

As result, the API behaviour is inconsistent. The files from noexec
mount can be mapped with PROT_EXEC, but if mprotect(2) drops execution
permission, it cannot be re-enabled later. Make this consistent
logically and aligned with behaviour of other systems, by disallowing
PROT_EXEC for mmap(2).

Note that this change only ensures aligned results from mmap(2) and
mprotect(2), it does not prevent actual code execution from files
coming from noexec mount. Such files can always be read into
anonymous executable memory and executed from there.

Reported by: shamaz.mazum@gmail.com
PR: 217062
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 987ff181 12-Feb-2017 Konstantin Belousov <kib@FreeBSD.org>

Consistently handle negative or wrapping offsets in the mmap(2) syscalls.

For regular files and posix shared memory, POSIX requires that
[offset, offset + size) range is legitimate. At the maping time,
check that offset is not negative. Allowing negative offsets might
expose the data that filesystem put into vm_object for internal use,
esp. due to OFF_TO_IDX() signess treatment. Fault handler verifies
that the mapped range is valid, assuming that mmap(2) checked that
arithmetic gives no undefined results.

For device mappings, leave the semantic of negative offsets to the
driver. Correct object page index calculation to not erronously
propagate sign.

In either case, disallow overflow of offset + size.

Update mmap(2) man page to explain the requirement of the range
validity, and behaviour when the range becomes invalid after mapping.

Reported and tested by: royger (previous version)
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# e83a71c6 10-Feb-2017 Konstantin Belousov <kib@FreeBSD.org>

Fix r313495.

The file type DTYPE_VNODE can be assigned as a fallback if VOP_OPEN()
did not initialized file type. This is a typical code path used by
normal file systems.

Also, change error returned for inappropriate file type used for
O_EXLOCK to EOPNOTSUPP, as declared in the open(2) man page.

Reported by: cy, dhw, Iblis Lin <iblis@hs.ntnu.edu.tw>
Tested by: dhw
Sponsored by: The FreeBSD Foundation
MFC after: 13 days


# e628e1b9 09-Feb-2017 Konstantin Belousov <kib@FreeBSD.org>

Increase a chance of devfs_close() calling d_close cdevsw method.

If a file opened over a vnode has an advisory lock set at close,
vn_closefile() acquires additional vnode use reference to prevent
freeing the vnode in vn_close(). Side effect is that for device
vnodes, devfs_close() sees that vnode reference count is greater than
one and refuses to call d_close(). Create internal version of
vn_close() which can avoid dropping the vnode reference if needed, and
use this to execute VOP_CLOSE() without acquiring a new reference.

Note that any parallel reference to the vnode would still prevent
d_close call, if the reference is not from an opened file, e.g. due to
stat(2).

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 7903b000 09-Feb-2017 Konstantin Belousov <kib@FreeBSD.org>

Do not establish advisory locks when doing open(O_EXLOCK) or open(O_SHLOCK)
for files which do not have DTYPE_VNODE type.

Both flock(2) and fcntl(2) syscalls refuse to acquire advisory lock on
a file which type is not DTYPE_VNODE. Do the same when lock is
requested from open(2).

Restructure the block in vn_open_vnode() which handles O_EXLOCK and
O_SHLOCK open flags to make it easier to quit its execution earlier
with an error.

Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# f1f7f1cb 27-Jan-2017 Mateusz Guzik <mjg@FreeBSD.org>

hwpmc: partially depessimize mmap handling if the module is not loaded

In particular this means the pmc sx lock is no longer taken when an
executable mapping succeeds.

MFC after: 1 week


# 25c68168 22-Jan-2017 Konstantin Belousov <kib@FreeBSD.org>

More style cleanup. Use ANSI C definition for vn_closefile(). Switch
to VNASSERT in _vn_lock(), simplify messages.

Sponsored by: The FreeBSD Foundation
X-MFC with: r312600, r312601, r312602, r312606


# eaf0969b 21-Jan-2017 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fix LK_RETRY logic braino in r312600


# 829857c8 21-Jan-2017 Mateusz Guzik <mjg@FreeBSD.org>

vfs: __predict_false the need to handle F_HASLOCK

Also reorder the check with DTYPE_VNODE. Passed files are vnodes vast
majority of the time, so it is typically true.


# abbc538d 21-Jan-2017 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fix whitespace damage in r312600

While here wrap the previously overly long line so that it fits 80 chars.


# 1091fb52 21-Jan-2017 Mateusz Guzik <mjg@FreeBSD.org>

vfs: refactor _vn_lock

Stop testing for LK_RETRY and error multiple times. Also postpone the
VI_DOOMED until after LK_RETRY was seen as it reads from the vnode.

No functional changes.


# 69a28758 15-Sep-2016 Ed Maste <emaste@FreeBSD.org>

Renumber license clauses in sys/kern to avoid skipping #3


# c3c0088b 20-Aug-2016 Robert Watson <rwatson@FreeBSD.org>

Audit additional vnode information in the implementation of the
ftruncate(2) system call. This was not required by the Common
Criteria, which needed only open-time audit.

MFC after: 2 weeks
Sponsored by: DARPA, AFRL


# af326ace 25-Jul-2016 Conrad Meyer <cem@FreeBSD.org>

devfs: Move most ioctl logic down to vnode layer

Devfs' file layer ioctl is now just a thin shim around the vnode layer.

Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7286


# 971711fb 05-Jul-2016 Robert Watson <rwatson@FreeBSD.org>

Call audit hooks to capture vnode attributes for three file-descriptor
method implementations: fstat(2), close(2), and poll(2). This change
synchronises auditing here with similar auditing for VFS-specific system
calls such as stat(2) that audit more complete vnode information.

Sponsored by: DARPA, AFRL
Approved by: re (kib)
MFC after: 1 week


# 3f7ca894 17-May-2016 Konstantin Belousov <kib@FreeBSD.org>

Ensure that ftruncate(2) is performed synchronously when file is
opened in O_SYNC mode, at least for UFS. This also handles
truncation, done due to the O_SYNC | O_TRUNC flags combination to
open(2), in synchronous way.

Noted by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 31b67320 29-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/kern: spelling fixes.

Mostly on comments but affects some debug messages.

MFC after: 2 weeks


# 74b8d63d 10-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

Cleanup unnecessary semicolons from the kernel.

Found with devel/coccinelle.


# 6adf1948 22-Jan-2016 Konstantin Belousov <kib@FreeBSD.org>

The struct file f_advice member is overlaid with the devfs f_cdevpriv
data. If vnode bypass for devfs file failed, vn_read/vn_write are
called and might try to dereference f_advice. Limit the accesses to
f_advice to VREG vnodes only, which is the type ensured by
posix_fadvise().

The f_advice for regular files is protected by mtxpool lock. Recheck
that f_advice is not NULL after lock is taken.

Reported and tested by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks


# ce958bde 17-Jan-2016 Konstantin Belousov <kib@FreeBSD.org>

When cleaning up from failed adv locking and checking for write, do
not call VOP_CLOSE() manually. Instead, delegate the close to
fo_close() performed as part of the fdrop() on the file failed to
open. For this, finish constructing file on error, in particular, set
f_vnode and f_ops.

Forcibly resetting f_ops to badfileops disabled additional cleanups
performed by fo_close() for some file types, in this case it was noted
that cdevpriv data was corrupted. Since fo_close() call must be
enabled for some file types, it makes more sense to enable it for all
files opened through vn_open_cred().

In collaboration with: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 78e79434 08-Oct-2015 Fabien Thomas <fabient@FreeBSD.org>

Fix r283998 that broke mapin events for hwpmc.

Reviewed by: jhb
Sponsored by: Stormshield


# 3138cd36 30-Sep-2015 Mark Johnston <markj@FreeBSD.org>

As a step towards the elimination of PG_CACHED pages, rework the handling
of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to
the head of the inactive queue instead of being cached.

This affects the implementation of POSIX_FADV_NOREUSE as well, since it
works by applying POSIX_FADV_DONTNEED to file ranges after they have been
read or written. At that point the corresponding buffers may still be
dirty, so the previous implementation would coalesce successive ranges and
apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the
dirty buffers would eventually be cached. To preserve this behaviour in an
efficient manner, this change adds a new buf flag, B_NOREUSE, which causes
the pages backing a VMIO buf to be placed at the head of the inactive queue
when the buf is released. POSIX_FADV_NOREUSE then works by setting this
flag in bufs that underlie the specified range.

Reviewed by: alc, kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3726


# 9e18c9eb 09-Sep-2015 Konstantin Belousov <kib@FreeBSD.org>

For open("name", O_DIRECTORY | O_CREAT), do not try to create the
named node, open(2) cannot create directories. But do allow the flag
combination to succeed if the directory already exists.

Declare the open("name", O_DIRECTORY | O_CREAT | O_EXCL) always
invalid for the same reason, since open(2) cannot create directory.

Note that there is an argument that O_DIRECTORY | O_CREAT should be
invalid always, regardless of the target directory existence or
O_EXCL. The current fix is conservative and allows the call to
succeed in the situation where it succeeded before the patch.

Reported by: Tom Ridge <freebsd@tom-ridge.com>
Reviewed by: rwatson
PR: 202892
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 14bdbaf2 03-Sep-2015 Conrad Meyer <cem@FreeBSD.org>

Detect badly behaved coredump note helpers

Coredump notes depend on being able to invoke dump routines twice; once
in a dry-run mode to get the size of the note, and another to actually
emit the note to the corefile.

When a note helper emits a different length section the second time
around than the length it requested the first time, the kernel produces
a corrupt coredump.

NT_PROCSTAT_FILES output length, when packing kinfo structs, is tied to
the length of filenames corresponding to vnodes in the process' fd table
via vn_fullpath. As vnodes may move around during dump, this is racy.

So:

- Detect badly behaved notes in putnote() and pad underfilled notes.

- Add a fail point, debug.fail_point.fill_kinfo_vnode__random_path to
exercise the NT_PROCSTAT_FILES corruption. It simply picks random
lengths to expand or truncate paths to in fo_fill_kinfo_vnode().

- Add a sysctl, kern.coredump_pack_fileinfo, to allow users to
disable kinfo packing for PROCSTAT_FILES notes. This should avoid
both FILES note corruption and truncation, even if filenames change,
at the cost of about 1 kiB in padding bloat per open fd. Document
the new sysctl in core.5.

- Fix note_procstat_files to self-limit in the 2nd pass. Since
sometimes this will result in a short write, pad up to our advertised
size. This addresses note corruption, at the risk of sometimes
truncating the last several fd info entries.

- Fix NT_PROCSTAT_FILES consumers libutil and libprocstat to grok the
zero padding.

With suggestions from: bjk, jhb, kib, wblock
Approved by: markj (mentor)
Relnotes: yes
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3548


# 89177288 30-Jul-2015 Konstantin Belousov <kib@FreeBSD.org>

vn_io_fault() handling of the LOR for i/o into the file-backed buffers
has observable overhead when the buffer pages are not resident or not
mapped. The overhead comes at least from two factors, one is the
additional work needed to detect the situation, prepare and execute
the rollbacks. Another is the consequence of the i/o splitting into
the batches of the held pages, causing filesystems see series of the
smaller i/o requests instead of the single large request.

Note that expected case of the resident i/o buffer does not expose
these issues. Provide a prefaulting for the userspace i/o buffers,
disabled by default. I am careful of not enabling prefaulting by
default for now, since it would be detrimental for the applications
which speculatively pass extra-large buffers of anonymous memory to
not deal with buffer sizing (if such apps exist).

Found and tested by: bde, emaste
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 5f34e93c 05-Jul-2015 Mark Johnston <markj@FreeBSD.org>

Check suspendability on the mountpoint returned by VOP_GETWRITEMOUNT.
This obviates the need for a MNTK_SUSPENDABLE flag, since passthrough
filesystems like nullfs and unionfs no longer need to inherit this
information from their lower layer(s). This change also restores the
pre-r273336 behaviour of using the presence of a susp_clean VFS method to
request suspension support.

Reviewed by: kib, mjg
Differential Revision: https://reviews.freebsd.org/D2937


# f6f6d240 10-Jun-2015 Mateusz Guzik <mjg@FreeBSD.org>

Implement lockless resource limits.

Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.


# 7077c426 04-Jun-2015 John Baldwin <jhb@FreeBSD.org>

Add a new file operations hook for mmap operations. File type-specific
logic is now placed in the mmap hook implementation rather than requiring
it to be placed in sys/vm/vm_mmap.c. This hook allows new file types to
support mmap() as well as potentially allowing mmap() for existing file
types that do not currently support any mapping.

The vm_mmap() function is now split up into two functions. A new
vm_mmap_object() function handles the "back half" of vm_mmap() and accepts
a referenced VM object to map rather than a (handle, handle_type) tuple.
vm_mmap() is now reduced to converting a (handle, handle_type) tuple to a
a VM object and then calling vm_mmap_object() to handle the actual mapping.
The vm_mmap() function remains for use by other parts of the kernel
(e.g. device drivers and exec) but now only supports mapping vnodes,
character devices, and anonymous memory.

The mmap() system call invokes vm_mmap_object() directly with a NULL object
for anonymous mappings. For mappings using a file descriptor, the
descriptors fo_mmap() hook is invoked instead. The fo_mmap() hook is
responsible for performing type-specific checks and adjustments to
arguments as well as possibly modifying mapping parameters such as flags
or the object offset. The fo_mmap() hook routines then call
vm_mmap_object() to handle the actual mapping.

The fo_mmap() hook is optional. If it is not set, then fo_mmap() will
fail with ENODEV. A fo_mmap() hook is implemented for regular files,
character devices, and shared memory objects (created via shm_open()).

While here, consistently use the VM_PROT_* constants for the vm_prot_t
type for the 'prot' variable passed to vm_mmap() and vm_mmap_object()
as well as the vm_mmap_vnode() and vm_mmap_cdev() helper routines.
Previously some places were using the mmap()-specific PROT_* constants
instead. While this happens to work because PROT_xx == VM_PROT_xx,
using VM_PROT_* is more correct.

Differential Revision: https://reviews.freebsd.org/D2658
Reviewed by: alc (glanced over), kib
MFC after: 1 month
Sponsored by: Chelsio


# 2db0e1f5 27-May-2015 Konstantin Belousov <kib@FreeBSD.org>

Add V_MNTREF flag to the vn_start_write(9) and
vn_start_secondary_write(9) functions. The flag indicates that the
caller already owns a reference on the mount point, and the functions
can consume it. The reference is released by vn_finished_write(9) and
vn_finished_secondary_write(9) in due course.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# d5fec489 21-Apr-2015 Craig Rodrigues <rodrigc@FreeBSD.org>

Support file verification in MAC.

* Add VCREAT flag to indicate when a new file is being created
* Add VVERIFY to indicate verification is required
* Both VCREAT and VVERIFY are only passed on the MAC method vnode_check_open
and are removed from the accmode after
* Add O_VERIFY flag to rtld open of objects
* Add 'v' flag to __sflags to set O_VERIFY flag.

Submitted by: Steve Kiernan <stevek@juniper.net>
Obtained from: Juniper Networks, Inc.
GitHub Pull Request: https://github.com/freebsd/freebsd/pull/27
Relnotes: yes


# 8ee9765a 21-Dec-2014 Konstantin Belousov <kib@FreeBSD.org>

Add VN_OPEN_NAMECACHE flag for vn_open_cred(9), which requests that
the created file name was cached. Use the flag for core dumps.

Requested by: rpaulo
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 6c21f6ed 18-Dec-2014 Konstantin Belousov <kib@FreeBSD.org>

The VOP_LOOKUP() implementations for CREATE op do not put the name
into namecache, to avoid cache trashing when doing large operations.
E.g., tar archive extraction is not usually followed by access to many
of the files created.

Right now, each VOP_LOOKUP() implementation explicitely knowns about
this quirk and tests for both MAKEENTRY flag presence and op != CREATE
to make the call to cache_enter(). Centralize the handling of the
quirk into VFS, by deciding to cache only by MAKEENTRY flag in VOP.
VFS now sets NOCACHE flag for CREATE namei() calls.

Note that the change in semantic is backward-compatible and could be
merged to the stable branch, and is compatible with non-changed
third-party filesystems which correctly handle MAKEENTRY.

Suggested by: Chris Torek <torek@pi-coral.com>
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 0061ddb3 13-Dec-2014 Konstantin Belousov <kib@FreeBSD.org>

Only sleep interruptible while waiting for suspension end when
filesystem specified VFCF_SBDRY flag, i.e. for NFS.

There are two issues with the sleeps. First, applications may get
unexpected EINTR from the disk i/o syscalls. Second, interruptible
sleep allows the stop of the process, and since mount point is
referenced while thread sleeps, unmount cannot free mount point
structure' memory, blocking unmount indefinitely.

Even for NFS, it is probably only reasonable to enable PCATCH for intr
mounts, but this information is currently not available at VFS level.

Reported and tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 5fab60a0 14-Nov-2014 Konstantin Belousov <kib@FreeBSD.org>

In vfs_write_suspend_umnt(), if suspension cannot be established, do
not forget to restore write ops count when returning the error.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 4fce16e4 20-Oct-2014 Mateusz Guzik <mjg@FreeBSD.org>

Provide vfs suspension support only for filesystems which need it, take
two.

nullfs and unionfs need to request suspension if underlying filesystem(s)
use it. Utilize mnt_kern_flag for this purpose.

This is a fixup for 273271.

No strong objections from: kib
Pointy hat to: mjg
MFC after: 2 weeks


# 020b8f17 19-Oct-2014 Mateusz Guzik <mjg@FreeBSD.org>

Provide vfs suspension support only for filesystems which need it.

Need is expressed by providing vfs_susp_clean function in vfsops.

Differential Revision: D952
Reviewed by: kib (previous version)
MFC after: 2 weeks


# 4142462e 04-Oct-2014 Konstantin Belousov <kib@FreeBSD.org>

Slightly reword comment. Move code, which is described by the
comment, after it.

Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# e3d6fece 04-Oct-2014 Konstantin Belousov <kib@FreeBSD.org>

Add IO_RANGELOCKED flag for vn_rdwr(9), which specifies that vnode is
not locked, but range is.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 9696feeb 22-Sep-2014 John Baldwin <jhb@FreeBSD.org>

Add a new fo_fill_kinfo fileops method to add type-specific information to
struct kinfo_file.
- Move the various fill_*_info() methods out of kern_descrip.c and into the
various file type implementations.
- Rework the support for kinfo_ofile to generate a suitable kinfo_file object
for each file and then convert that to a kinfo_ofile structure rather than
keeping a second, different set of code that directly manipulates
type-specific file information.
- Remove the shm_path() and ksem_info() layering violations.

Differential Revision: https://reviews.freebsd.org/D775
Reviewed by: kib, glebius (earlier version)


# 037755fd 26-Aug-2014 Mateusz Guzik <mjg@FreeBSD.org>

Fix up races with f_seqcount handling.

It was possible that the kernel would overwrite user-supplied hint.

Abuse vnode lock for this purpose.

In collaboration with: kib
MFC after: 1 week


# 895b3782 14-Jul-2014 Konstantin Belousov <kib@FreeBSD.org>

Extract the code to put a filesystem into the suspended state (at the
unmount time) in the helper vfs_write_suspend_umnt(). Use it instead
of two inline copies in FFS.

Fix the bug in the FFS unmount, when suspension failed, the ufs
extattrs were not reinitialized.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# a6945216 14-Jul-2014 Konstantin Belousov <kib@FreeBSD.org>

Generalize vn_get_ino() to allow filesystems to use custom vnode
producer, instead of hard-coding VFS_VGET(). New function, which
takes callback, is called vn_get_ino_gen(), standard callback for
vn_get_ino() is provided.

Convert inline copies of vn_get_ino() in msdosfs and cd9660 into the
uses of vn_get_ino_gen().

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 7b81a399 17-Jun-2014 Konstantin Belousov <kib@FreeBSD.org>

In msdosfs_setattr(), add a check for result of the utimes(2)
permissions test, forgotten in r164033.

Refactor the permission checks for utimes(2) into vnode helper
function vn_utimes_perm(9), and simplify its code comparing with the
UFS origin, by writing the call to VOP_ACCESSX only once. Use the
helper for UFS(5), tmpfs(5), devfs(5) and msdosfs(5).

Reported by: bde
Reviewed by: bde, trasz
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 2e501b0a 14-Jun-2014 Konstantin Belousov <kib@FreeBSD.org>

Use vn_io_fault for the writes from core dumping code. Recursing into
VM due to copyin(9) faulting while VFS locks are held is
deadlock-prone there in the same way as for the write(2) syscall.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 6f2b769c 15-Mar-2014 John-Mark Gurney <jmg@FreeBSD.org>

change td_retval into a union w/ off_t, with defines to mask the
change... This eliminates a cast, and also forces td_retval
(often 2 32-bit registers) to be aligned so that off_t's can be
stored there on arches with strict alignment requirements like
armeb (AVILA)... On i386, this doesn't change alignment, and on
amd64 it doesn't either, as register_t is already 64bits...

This will also prevent future breakage due to people adding additional
fields to the struct...

This gets AVILA booting a bit farther...

Reviewed by: bde


# 65f05eeb 17-Dec-2013 Konstantin Belousov <kib@FreeBSD.org>

If vn_open_vnode() succeeded in opening the vnode, but subsequent
advisory lock cannot be obtained, prevent double-close of the vnode in
vn_close() called from the fdrop(), by resetting file' f_ops methods.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 7e14088d 20-Nov-2013 Konstantin Belousov <kib@FreeBSD.org>

Revert back to use int for the page counts. In vn_io_fault(), the i/o
is chunked to pieces limited by integer io_hold_cnt tunable, while
vm_fault_quick_hold_pages() takes integer max_count as the upper bound.

Rearrange the checks to correctly handle overflowing address arithmetic.

Submitted by: bde
Tested by: pho
Discussed with: alc
MFC after: 1 week


# d005ed53 12-Nov-2013 Konstantin Belousov <kib@FreeBSD.org>

Avoid overflow for the page counts.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 6272798a 09-Nov-2013 Konstantin Belousov <kib@FreeBSD.org>

Both vn_close() and VFS_PROLOGUE() evaluate vp->v_mount twice, without
holding the vnode lock; vp->v_mount is checked first for NULL
equiality, and then dereferenced if not NULL. If vnode is reclaimed
meantime, second dereference would still give NULL. Change
VFS_PROLOGUE() to evaluate the mp once, convert MNTK_SHARED_WRITES and
MNTK_EXTENDED_SHARED tests into inline functions.

Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 9bec6325 13-Sep-2013 Konstantin Belousov <kib@FreeBSD.org>

When opening or closing fifo, ensure that the vnode is locked
exclusively. Filesystems are assumed to disable shared locking for
the fifo vnode locks, but some do not.

Reported and tested by: olgeni
Discussed with: avg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (glebius)


# c0a46535 21-Aug-2013 Konstantin Belousov <kib@FreeBSD.org>

Make the seek a method of the struct fileops.

Tested by: pho
Sponsored by: The FreeBSD Foundation


# b1dd38f4 16-Aug-2013 Konstantin Belousov <kib@FreeBSD.org>

Restore the previous sendfile(2) behaviour on the block devices.
Provide valid .fo_sendfile method for several missed struct fileops.

Reviewed by: glebius
Sponsored by: The FreeBSD Foundation


# ca04d21d 15-Aug-2013 Gleb Smirnoff <glebius@FreeBSD.org>

Make sendfile() a method in the struct fileops. Currently only
vnode backed file descriptors have this method implemented.

Reviewed by: kib
Sponsored by: Nginx, Inc.
Sponsored by: Netflix


# cc3d8c35 09-Jul-2013 Konstantin Belousov <kib@FreeBSD.org>

There are several code sequences like
vfs_busy(mp);
vfs_write_suspend(mp);
which are problematic if other thread starts unmount between two
calls. The unmount starts a write, while vfs_write_suspend() drain
writers. On the other hand, unmount drains busy references, causing
the deadlock.

Add a flag argument to vfs_write_suspend and require the callers of it
to specify VS_SKIP_UNMOUNT flag, when the call is performed not in the
mount path, i.e. the covered vnode is not locked. The suspension is
not attempted if VS_SKIP_UNMOUNT is specified and unmount is in
progress.

Reported and tested by: Andreas Longwitz <longwitz@incore.de>
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks


# 3d4c503c 31-May-2013 John Baldwin <jhb@FreeBSD.org>

Style fixes to vn_ioctl().

Suggested by: bde


# dfa66c01 03-May-2013 John Baldwin <jhb@FreeBSD.org>

Fix FIONREAD on regular files. The computed result was being ignored and
it was being passed down to VOP_IOCTL() where it promptly resulted in
ENOTTY due to a missing else for the past 8 years. While here, use a
shared vnode lock while fetching the current file's size.

MFC after: 1 week


# 926cd204 30-Mar-2013 Matthew D Fleming <mdf@FreeBSD.org>

Use a shared lock for VOP_GETEXTATTR, as it is a read-like operation.

MFC after: 1 week


# aed5a114 15-Mar-2013 Konstantin Belousov <kib@FreeBSD.org>

Separate the copyright lines and the informational block by a blank line.

Requested by: joel
MFC after: 2 weeks


# 5791cee8 14-Mar-2013 Konstantin Belousov <kib@FreeBSD.org>

Add my copyright for the 2012 year work, in particular vn_io_fault()
and f_offset locking. Add required Foundation notice for r248319.

Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 5f5f0554 15-Mar-2013 Konstantin Belousov <kib@FreeBSD.org>

Implement the helper function vn_io_fault_pgmove(), intended to use by
the filesystem VOP_READ() and VOP_WRITE() implementations in the same
way as vn_io_fault_uiomove() over the unmapped buffers. Helper
provides the convenient wrapper over the pmap_copy_pages() for struct
uio consumers, taking care of the TDP_UIOHELD situations.

Sponsored by: The FreeBSD Foundation
Tested by: pho
MFC after: 2 weeks


# 89f6b863 08-Mar-2013 Attilio Rao <attilio@FreeBSD.org>

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


# 71ac38e8 01-Mar-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Remove unnecessary variables.


# d7ffa248 15-Feb-2013 Sergey Kandaurov <pluknet@FreeBSD.org>

vn_io_faults_cnt:
- use u_long consistently
- use SYSCTL_ULONG to match the type of variable

Reviewed by: kib
MFC after: 1 week


# 9e2677fd 31-Jan-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Simplify code a bit. This is leftover after Giant removal from VFS.


# ddd6b3fc 10-Jan-2013 Konstantin Belousov <kib@FreeBSD.org>

Add flags argument to vfs_write_resume() and remove
vfs_write_resume_flags().

Sponsored by: The FreeBSD Foundation


# f99cb34c 01-Jan-2013 Konstantin Belousov <kib@FreeBSD.org>

The process_deferred_inactive() function locks the vnodes of the ufs
mount, which means that is must not be called while the snaplock is
owned. The vfs_write_resume(9) does call the function as the
VFS_SUSP_CLEAN() method, which is too early and falls into the region
still protected by snaplock.

Add yet another flag for the vfs_write_resume_flags() to avoid calling
suspension cleanup handler after the suspend is lifted, and use it in
the ffs_snapshot() call to vfs_write_resume.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 91e94745 28-Dec-2012 Konstantin Belousov <kib@FreeBSD.org>

Make it possible to atomically resume writes on the mount and account
the write start, by adding a variation of the vfs_write_resume(9)
which accepts flags.

Use the new function to prevent a deadlock between parallel suspension
and snapshotting a UFS mount. The ffs_snapshot() code performed
vfs_write_resume() followed by vn_start_write() while owning the
snaplock. If the suspension intervene between resume and
vn_start_write(), the deadlock occured after the suspending thread
tried to lock the snaplock, most typically during the write in the
ffs_copyonwrite().

Reported and tested by: Andreas Longwitz <longwitz@incore.de>
Reviewed by: mckusick
MFC after: 2 weeks
X-MFC-note: make the vfs_write_resume(9) function a macro after the MFC,
in HEAD


# f121e3e8 27-Nov-2012 Pawel Jakub Dawidek <pjd@FreeBSD.org>

- Add NOCAPCHECK flag to namei that allows lookup to work even if the process
is in capability mode.
- Add VN_OPEN_NOCAPCHECK flag for vn_open_cred() to will ne converted into
NOCAPCHECK namei flag.

This functionality will be used to enable core dumps for sandboxed processes.

Reviewed by: rwatson
Obtained from: WHEEL Systems
MFC after: 2 weeks


# 140dedb8 02-Nov-2012 Konstantin Belousov <kib@FreeBSD.org>

The r241025 fixed the case when a binary, executed from nullfs mount,
was still possible to open for write from the lower filesystem. There
is a symmetric situation where the binary could already has file
descriptors opened for write, but it can be executed from the nullfs
overlay.

Handle the issue by passing one v_writecount reference to the lower
vnode if nullfs vnode has non-zero v_writecount. Note that only one
write reference can be donated, since nullfs only keeps one use
reference on the lower vnode. Always use the lower vnode v_writecount
for the checks.

Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is
currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT
to manipulate the v_writecount value, which manages a single bypass
reference to the lower vnode. Caling the VOPs instead of directly
accessing v_writecount provide the fix described in the previous
paragraph.

Tested by: pho
MFC after: 3 weeks


# 5050aa86 22-Oct-2012 Konstantin Belousov <kib@FreeBSD.org>

Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho


# 877d24ac 28-Sep-2012 Konstantin Belousov <kib@FreeBSD.org>

Fix the mis-handling of the VV_TEXT on the nullfs vnodes.

If you have a binary on a filesystem which is also mounted over by
nullfs, you could execute the binary from the lower filesystem, or
from the nullfs mount. When executed from lower filesystem, the lower
vnode gets VV_TEXT flag set, and the file cannot be modified while the
binary is active. But, if executed as the nullfs alias, only the
nullfs vnode gets VV_TEXT set, and you still can open the lower vnode
for write.

Add a set of VOPs for the VV_TEXT query, set and clear operations,
which are correctly bypassed to lower vnode.

Tested by: pho (previous version)
MFC after: 2 weeks


# 8c706ce0 25-Sep-2012 Pawel Jakub Dawidek <pjd@FreeBSD.org>

vn_write() always expects FOF_OFFSET flag, which is asserted at the begining,
so there is no need to check for it.

Sponsored by: FreeBSD Foundation
MFC after: 2 weeks


# e838f09c 31-Jul-2012 John Baldwin <jhb@FreeBSD.org>

Reorder the managament of advisory locks on open files so that the advisory
lock is obtained before the write count is increased during open() and the
lock is released after the write count is decreased during close().

The first change closes a race where an open() that will block with O_SHLOCK
or O_EXLOCK can increase the write count while it waits. If the process
holding the current lock on the file then tries to call exec() on the file
it has locked, it can fail with ETXTBUSY even though the advisory lock is
preventing other threads from succesfully completeing a writable open().

The second change closes a race where a read-only open() with O_SHLOCK or
O_EXLOCK may return successfully while the write count is non-zero due to
another descriptor that had the advisory lock and was blocking the open()
still being in the process of closing. If the process that completed the
open() then attempts to call exec() on the file it locked, it can fail with
ETXTBUSY even though the other process that held a write lock has closed
the file and released the lock.

Reviewed by: kib
MFC after: 1 month


# c5c1199c 02-Jul-2012 Konstantin Belousov <kib@FreeBSD.org>

Extend the KPI to lock and unlock f_offset member of struct file. It
now fully encapsulates all accesses to f_offset, and extends f_offset
locking to other consumers that need it, in particular, to lseek() and
variants of getdirentries().

Ensure that on 32bit architectures f_offset, which is 64bit quantity,
always read and written under the mtxpool protection. This fixes
apparently easy to trigger race when parallel lseek()s or lseek() and
read/write could destroy file offset.

The already broken ABI emulations, including iBCS and SysV, are not
converted (yet).

Tested by: pho
No objections from: jhb
MFC after: 3 weeks


# 854c3ce7 21-Jun-2012 Konstantin Belousov <kib@FreeBSD.org>

Fix locking for f_offset, vn_read() and vn_write() cases only, for now.

It seems that intended locking protocol for struct file f_offset field
was as follows: f_offset should always be changed under the vnode lock
(except fcntl(2) and lseek(2) did not followed the rules). Since
read(2) uses shared vnode lock, FOFFSET_LOCKED block is additionally
taken to serialize shared vnode lock owners.

This was broken first by enabling shared lock on writes, then by
fadvise changes, which moved f_offset assigned from under vnode lock,
and last by vn_io_fault() doing chunked i/o. More, due to uio_offset
not yet valid in vn_io_fault(), the range lock for reads was taken on
the wrong region.

Change the locking for f_offset to always use FOFFSET_LOCKED block,
which is placed before rangelocks in the lock order.

Extract foffset_lock() and foffset_unlock() functions which implements
FOFFSET_LOCKED lock, and consistently lock f_offset with it in the
vn_io_fault() both for reads and writes, even if MNTK_NO_IOPF flag is
not set for the vnode mount. Indicate that f_offset is already valid
for vn_read() and vn_write() calls from vn_io_fault() with FOF_OFFSET
flag, and assert that all callers of vn_read() and vn_write() follow
this protocol.

Extract get_advice() function to calculate the POSIX_FADV_XXX value
for the i/o region, and use it were appropriate.

Reviewed by: jhb
Tested by: pho
MFC after: 2 weeks


# cd4ecf3c 19-Jun-2012 John Baldwin <jhb@FreeBSD.org>

Further refine the implementation of POSIX_FADV_NOREUSE.

First, extend the changes in r230782 to better handle the common case
of using NOREUSE with sequential reads. A NOREUSE file descriptor
will now track the last implicit DONTNEED request it made as a result
of a NOREUSE read. If a subsequent NOREUSE read is adjacent to the
previous range, it will apply the DONTNEED request to the entire range
of both the previous read and the current read. The effect is that
each read of a file accessed sequentially will apply the DONTNEED
request to the entire range that has been read. This allows NOREUSE
to properly handle misaligned reads by flushing each buffer to cache
once it has been completely read.

Second, apply the same changes made to read(2) by r230782 and this
change to writes. This provides much better performance in the
sequential write case as it allows writes to still be clustered. It
also provides much better performance for misaligned writes. It does
mean that NOREUSE will be generally ineffective for non-sequential
writes as the current implementation relies on a future NOREUSE
write's implicit DONTNEED request to flush the dirty buffer from the
current write.

MFC after: 2 weeks


# 7ac1b61a 08-Jun-2012 John Baldwin <jhb@FreeBSD.org>

Split the second half of vn_open_cred() (after a vnode has been found via
a lookup or created via VOP_CREATE()) into a new vn_open_vnode() function
and use this function in fhopen() instead of duplicating code from
vn_open_cred() directly.

Tested by: pho
Reviewed by: kib
MFC after: 2 weeks


# bba08085 03-Jun-2012 Konstantin Belousov <kib@FreeBSD.org>

Add a knob to disable vn_io_fault.

MFC after: 1 month


# bb2f52a6 03-Jun-2012 Konstantin Belousov <kib@FreeBSD.org>

Count and export the number of prefaulting happen.

MFC after: 1 month


# 41014d99 30-May-2012 Konstantin Belousov <kib@FreeBSD.org>

vn_io_fault() is a facility to prevent page faults while filesystems
perform copyin/copyout of the file data into the usermode
buffer. Typical filesystem hold vnode lock and some buffer locks over
the VOP_READ() and VOP_WRITE() operations, and since page fault
handler may need to recurse into VFS to get the page content, a
deadlock is possible.

The facility works by disabling page faults handling for the current
thread and attempting to execute i/o while allowing uiomove() to
access the usermode mapping of the i/o buffer. If all buffer pages are
resident, uiomove() is successfull and request is finished. If EFAULT
is returned from uiomove(), the pages backing i/o buffer are faulted
in and held, and the copyin/out is performed using uiomove_fromphys()
over the held pages for the second attempt of VOP call.

Since pages are hold in chunks to prevent large i/o requests from
starving free pages pool, and since vnode lock is only taken for
i/o over the current chunk, the vnode lock no longer protect atomicity
of the whole i/o request. Use newly added rangelocks to provide the
required atomicity of i/o regardind other i/o and truncations.

Filesystems need to explicitely opt-in into the scheme, by setting the
MNTK_NO_IOPF struct mount flag, and optionally by using
vn_io_fault_uiomove(9) helper which takes care of calling uiomove() or
converting uio into request for uiomove_fromphys().

Reviewed by: bf (comments), mdf, pjd (previous version)
Tested by: pho
Tested by: flo, Gustau P?rez <gperez entel upc edu> (previous version)
MFC after: 2 months


# 292520f7 25-May-2012 Konstantin Belousov <kib@FreeBSD.org>

Add a vn_bmap_seekhole(9) vnode helper which can be used by any
filesystem which supports VOP_BMAP(9) to implement SEEK_HOLE/SEEK_DATA
commands for lseek(2).

MFC after: 2 weeks


# b47f6241 08-Mar-2012 John Baldwin <jhb@FreeBSD.org>

Add KTR_VFS traces to track modifications to a vnode's writecount.


# 526d0bd5 20-Feb-2012 Konstantin Belousov <kib@FreeBSD.org>

Fix found places where uio_resid is truncated to int.

Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the
sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from
the usermode.

Discussed with: bde, das (previous versions)
MFC after: 1 month


# 2bd3e4c2 30-Jan-2012 John Baldwin <jhb@FreeBSD.org>

Refine the implementation of POSIX_FADV_NOREUSE for the read(2) case such
that instead of using direct I/O it allows read-ahead similar to
POSIX_FADV_NORMAL, but invokes VOP_ADVISE(POSIX_FADV_DONTNEED) after the
read(2) has completed to purge just-read data. The write(2) path continues
to use direct I/O for POSIX_FADV_NOREUSE for now. Note that NOREUSE works
optimally if an application reads and writes full fs blocks.


# 936c09ac 03-Nov-2011 John Baldwin <jhb@FreeBSD.org>

Add the posix_fadvise(2) system call. It is somewhat similar to
madvise(2) except that it operates on a file descriptor instead of a
memory region. It is currently only supported on regular files.

Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types. The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are
thus filesystem independent. Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.

The second type of hints are used to hint to the OS that data will or
will not be used. These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request. This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode. This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue. The second change adds
a new vm_object_page_cache() method. This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if
possible.

To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.

Reviewed by: jilles, kib, arch@
Approved by: re (kib)
MFC after: 1 month


# 8451d0dd 16-Sep-2011 Kip Macy <kmacy@FreeBSD.org>

In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)


# 82378711 25-Aug-2011 Martin Matuska <mm@FreeBSD.org>

Generalize ffs_pages_remove() into vn_pages_remove().

Remove mapped pages for all dataset vnodes in zfs_rezget() using
new vn_pages_remove() to fix mmapped files changed by
zfs rollback or zfs receive -F.

PR: kern/160035, kern/156933
Reviewed by: kib, pjd
Approved by: re (kib)
MFC after: 1 week


# 9c00bb91 16-Aug-2011 Konstantin Belousov <kib@FreeBSD.org>

Add the fo_chown and fo_chmod methods to struct fileops and use them
to implement fchown(2) and fchmod(2) support for several file types
that previously lacked it. Add MAC entries for chown/chmod done on
posix shared memory and (old) in-kernel posix semaphores.

Based on the submission by: glebius
Reviewed by: rwatson
Approved by: re (bz)


# 3d08a76b 12-May-2011 Matthew D Fleming <mdf@FreeBSD.org>

Use a name instead of a magic number for kern_yield(9) when the priority
should not change. Fetch the td_user_pri under the thread lock. This
is probably not necessary but a magic number also seems preferable to
knowing the implementation details here.

Requested by: Jason Behmer < jason DOT behmer AT isilon DOT com >


# e7ceb1e9 07-Feb-2011 Matthew D Fleming <mdf@FreeBSD.org>

Based on discussions on the svn-src mailing list, rework r218195:

- entirely eliminate some calls to uio_yeild() as being unnecessary,
such as in a sysctl handler.

- move should_yield() and maybe_yield() to kern_synch.c and move the
prototypes from sys/uio.h to sys/proc.h

- add a slightly more generic kern_yield() that can replace the
functionality of uio_yield().

- replace source uses of uio_yield() with the functional equivalent,
or in some cases do not change the thread priority when switching.

- fix a logic inversion bug in vlrureclaim(), pointed out by bde@.

- instead of using the per-cpu last switched ticks, use a per thread
variable for should_yield(). With PREEMPTION, the only reasonable
use of this is to determine if a lock has been held a long time and
relinquish it. Without PREEMPTION, this is essentially the same as
the per-cpu variable.


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 3297cdd0 26-Jun-2010 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Correct arguments order.


# 77dda2b9 06-May-2010 Edward Tomasz Napierala <trasz@FreeBSD.org>

Avoid overflow.

Submitted by: bde@


# 307d88b7 06-May-2010 Edward Tomasz Napierala <trasz@FreeBSD.org>

Style fixes and removal of unneeded variable.

Submitted by: bde@


# b5f770bd 05-May-2010 Edward Tomasz Napierala <trasz@FreeBSD.org>

Move checking against RLIMIT_FSIZE into one place, vn_rlimit_fsize().

Reviewed by: kib


# 0866329b 17-Apr-2010 Andriy Gapon <avg@FreeBSD.org>

MFC r206129: vn_stat: use va_blocksize when setting st_blksize


# 364b8a7b 03-Apr-2010 Andriy Gapon <avg@FreeBSD.org>

vn_stat: take into account va_blocksize when setting st_blksize

As currently st_blksize is always PAGE_SIZE, it is playing safe to not
use any smaller value. For some cases this might not be optimal, but
at least nothing should get broken.

Generally I don't expect this commit to change much for the following
reasons (in case of VREG, VDIR):
- application I/O and physical I/O are sufficiently decoupled by
filesystem code, buffer cache code, cluster and read-ahead logic
- not all applications use st_blksize as a hint, some use f_iosize, some
use fixed block sizes

I expect writes to the middle of files on ZFS to benefit the most from
this change.

Silence from: fs@
MFC after: 2 weeks


# 510ea843 28-Mar-2010 Ed Schouten <ed@FreeBSD.org>

Rename st_*timespec fields to st_*tim for POSIX 2008 compliance.

A nice thing about POSIX 2008 is that it finally standardizes a way to
obtain file access/modification/change times in sub-second precision,
namely using struct timespec, which we already have for a very long
time. Unfortunately POSIX uses different names.

This commit adds compatibility macros, so existing code should still
build properly. Also change all source code in the kernel to work
without any of the compatibility macros. This makes it all a less
ambiguous.

I am also renaming st_birthtime to st_birthtim, even though it was a
local extension anyway. It seems Cygwin also has a st_birthtim.


# bf876fcd 27-Mar-2010 Edward Tomasz Napierala <trasz@FreeBSD.org>

MFC r200273:

Don't add VAPPEND if the file is not being opened for writing. Note that this
only affects cases where open(2) is being used improperly - i.e. when the user
specifies O_APPEND without O_WRONLY or O_RDWR.

Reviewed by: rwatson


# 0fef797f 21-Mar-2010 Ed Schouten <ed@FreeBSD.org>

Actually make O_DIRECTORY work.

According to POSIX open() must return ENOTDIR when the path name does
not refer to a path name. Change vn_open() to respect this flag. This
also simplifies the Linuxolator a bit.


# 9d7031a6 08-Dec-2009 Edward Tomasz Napierala <trasz@FreeBSD.org>

Don't add VAPPEND if the file is not being opened for writing. Note that this
only affects cases where open(2) is being used improperly - i.e. when the user
specifies O_APPEND without O_WRONLY or O_RDWR.

Reviewed by: rwatson


# 931d1367 07-Dec-2009 Xin LI <delphij@FreeBSD.org>

MFC revision 197579 and 199617:

Add two new fcntls to enable/disable read-ahead:

- F_READAHEAD: specify the amount for sequential access. The amount is
specified in bytes and is rounded up to nearest block size.
- F_RDAHEAD: Darwin compatible version that use 128KB as the sequential
access size.

A third argument of zero disables the read-ahead behavior.

Please note that the read-ahead amount is also constrainted by sysctl
variable, vfs.read_max, which may need to be raised in order to better
utilize this feature.

Thanks Igor Sysoev for proposing the feature and submitting the original
version, and kib@ for his valuable comments.


# 3b1b0980 04-Nov-2009 Edward Tomasz Napierala <trasz@FreeBSD.org>

Revert r198874, pending further discussion.


# 8fafa5ce 03-Nov-2009 Edward Tomasz Napierala <trasz@FreeBSD.org>

Make sure we don't end up with VAPPEND without VWRITE, if someone calls open(2)
like this: open(..., O_APPEND).


# 82aebf69 28-Sep-2009 Xin LI <delphij@FreeBSD.org>

Add two new fcntls to enable/disable read-ahead:

- F_READAHEAD: specify the amount for sequential access. The amount is
specified in bytes and is rounded up to nearest block size.
- F_RDAHEAD: Darwin compatible version that use 128KB as the sequential
access size.

A third argument of zero disables the read-ahead behavior.

Please note that the read-ahead amount is also constrainted by sysctl
variable, vfs.read_max, which may need to be raised in order to better
utilize this feature.

Thanks Igor Sysoev for proposing the feature and submitting the original
version, and kib@ for his valuable comments.

Submitted by: Igor Sysoev <is rambler-co ru>
Reviewed by: kib@
MFC after: 1 month


# c02280f5 08-Sep-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r196692:
Make the mnt_writeopcount and mnt_secondary_writes counters,
used by the suspension code, not greater then mnt_ref reference
counter value.

MFC r196733:
Fix mount reference leak when V_XSLEEP is specified to vn_start_write().

Approved by: re (kensmith)


# 579b9760 31-Aug-2009 Konstantin Belousov <kib@FreeBSD.org>

Fix mount reference leak when V_XSLEEP is specified to vn_start_write().

Submitted by: tegge


# a505c2c7 31-Aug-2009 Konstantin Belousov <kib@FreeBSD.org>

Make the mnt_writeopcount and mnt_secondary_writes counters,
used by the suspension code, not greater then mnt_ref reference
counter value. Increment mnt_ref together with write counter
in vn_start_write()/ vn_start_secondary_write(), releasing in
vn_finished_write/vn_finished_secondary_write().

Since r186197, unmount code requires that no writers occured after all
references are expired. We still could get write counter incremented
for freed or reused struct mount, but it seems to be innocent, since
corresponding vnode should be referenced and reclaimed then.

Reported by: pho (last half a year), erwin
Reviewed by: attilio
Tested by: pho, erwin
MFC after: 1 week


# f1eccd05 02-Jul-2009 Konstantin Belousov <kib@FreeBSD.org>

In vn_vget_ino() and their inline equivalents, mnt_ref() the mount point
around the sequence that drop vnode lock and then busies the mount point.
Not having vlocked node or direct reference to the mp allows for the
forced unmount to proceed, making mp unmounted or reused.

Tested by: pho
Reviewed by: jeff
Approved by: re (kensmith)
MFC after: 2 weeks


# e0c161b8 21-Jun-2009 Konstantin Belousov <kib@FreeBSD.org>

Add another flags argument to vn_open_cred. Use it to specify that some
vn_open_cred invocations shall not audit namei path.

In particular, specify VN_OPEN_NOAUDIT for dotdot lookup performed by
default implementation of vop_vptocnp, and for the open done for core
file. vn_fullpath is called from the audit code, and vn_open there need
to disable audit to avoid infinite recursion. Core file is created on
return to user mode, that, in particular, happens during syscall return.
The creation of the core file is audited by direct calls, and we do not
want to overwrite audit information for syscall.

Reported, reviewed and tested by: rwatson


# 27bfb741 08-Jun-2009 Paul Saab <ps@FreeBSD.org>

Simply shared vnode locking and extend it to also include fsync.
Also, in vop_write, no longer assert for exclusive locks on the
vnode.

Reviewed by: jhb, kmacy, jeffr


# bcf11e8d 05-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# 983b43c5 04-Jun-2009 Paul Saab <ps@FreeBSD.org>

When checking for shared writes, use the struct mount returned from
vn_start_write.

Reviewed by: jhb


# a6d545d8 04-Jun-2009 Paul Saab <ps@FreeBSD.org>

Support shared vnode locks for write operations when the offset is
provided on filesystems that support it. This really improves mysql
+ innodb performance on ZFS.

Reviewed by: jhb, kmacy, jeffr


# dfd233ed 11-May-2009 Attilio Rao <attilio@FreeBSD.org>

Remove the thread argument from the FSD (File-System Dependent) parts of
the VFS. Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.

In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.

While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.

VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled. Bump __FreeBSD_version in order to signal such
situation.


# 41b72e6e 07-May-2009 Konstantin Belousov <kib@FreeBSD.org>

Eliminate the loop and the call to pause(9) in vfs_vget_ino(). If
vfs_busy(MBF_NOWAIT) failed, unlock the vnode and sleep in vfs_busy().

Suggested and reviewed by: jeff
Tested by: pho
MFC after: 3 weeks


# 5e6a9266 13-Apr-2009 Kip Macy <kmacy@FreeBSD.org>

- use a shared lock for reads
- remove stale comment

Reviewed by: jeffr


# 885868cd 10-Apr-2009 Robert Watson <rwatson@FreeBSD.org>

Remove VOP_LEASE and supporting functions. This hasn't been used since
the removal of NQNFS, but was left in in case it was required for NFSv4.
Since our new NFSv4 client and server can't use it for their
requirements, GC the old mechanism, as well as other unused lease-
related code and interfaces.

Due to its impact on kernel programming and binary interfaces, this
change should not be MFC'd.

Proposed by: jeff
Reviewed by: jeff
Discussed with: rmacklem, zach loafman @ isilon


# 33fc3625 11-Mar-2009 John Baldwin <jhb@FreeBSD.org>

Add a new internal mount flag (MNTK_EXTENDED_SHARED) to indicate that a
filesystem supports additional operations using shared vnode locks.
Currently this is used to enable shared locks for open() and close() of
read-only file descriptors.
- When an ISOPEN namei() request is performed with LOCKSHARED, use a
shared vnode lock for the leaf vnode only if the mount point has the
extended shared flag set.
- Set LOCKSHARED in vn_open_cred() for requests that specify O_RDONLY but
not O_CREAT.
- Use a shared vnode lock around VOP_CLOSE() if the file was opened with
O_RDONLY and the mountpoint has the extended shared flag set.
- Adjust md(4) to upgrade the vnode lock on the vnode it gets back from
vn_open() since it now may only have a shared vnode lock.
- Don't enable shared vnode locks on FIFO vnodes in ZFS and UFS since
FIFO's require exclusive vnode locks for their open() and close()
routines. (My recent MPSAFE patches for UDF and cd9660 already included
this change.)
- Enable extended shared operations on UFS, cd9660, and UDF.

Submitted by: ups
Reviewed by: pjd (ZFS bits)
MFC after: 1 month


# e9aff357 21-Jan-2009 Konstantin Belousov <kib@FreeBSD.org>

Move the code from ufs_lookup.c used to do dotdot lookup, into
the helper function. It is supposed to be useful for any filesystem
that has to unlock dvp to walk to the ".." entry in lookup routine.

Requested by: jhb
Tested by: pho
MFC after: 1 month


# 0c6a80e7 28-Nov-2008 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Improve KASSERT() call a bit:
- Print flags in hex.
- Note that flags can be fine and panic can be due unexpected error condition.
- Remove redundant new line character.

Eventhough panic message excess 80 characters keep it in one line so it is
easier to grep.


# c5f77bf9 16-Nov-2008 Konstantin Belousov <kib@FreeBSD.org>

Revert r184118. There is actually a code in the kernel, for instance in
kern_unlinkat(), that expects that vn_start_write() actually fills the mp
even when the call failed.

As Tor noted, that pattern relies on the the type stability of the mount
points, as well as that suspended mount points are never freed and
V_XSLEEP is always passed to vn_start_write() when called on a freed
mount point.

Reported by: stass
Reviewed by: tegge
PR: 123768


# 21fc02d2 03-Nov-2008 John Baldwin <jhb@FreeBSD.org>

Use shared vnode locks instead of exclusive vnode locks for the access(),
chdir(), chroot(), eaccess(), fpathconf(), fstat(), fstatfs(), lseek()
(when figuring out the current size of the file in the SEEK_END case),
pathconf(), readlink(), and statfs() system calls.

Submitted by: ups (mostly)
Tested by: pho
MFC after: 1 month


# 83b3bdbc 02-Nov-2008 Attilio Rao <attilio@FreeBSD.org>

Improve VFS locking:
- Implement real draining for vfs consumers by not relying on the
mnt_lock and using instead a refcount in order to keep track of lock
requesters.
- Due to the change above, remove the mnt_lock lockmgr because it is now
useless.
- Due to the change above, vfs_busy() is no more linked to a lockmgr.
Change so its KPI by removing the interlock argument and defining 2 new
flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the
old version (which was unlinked from the lockmgr alredy) and
MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx
once the mnt interlock is held (ability still desired by most consumers).
- The stub used into vfs_mount_destroy(), that allows to override the
mnt_ref if running for more than 3 seconds, make it totally useless.
Remove it as it was thought to work into older versions.
If a problem of "refcount held never going away" should appear, we will
need to fix properly instead than trust on such hackish solution.
- Fix a bug where returning (with an error) from dounmount() was still
leaving the MNTK_MWAIT flag on even if it the waiters were actually
woken up. Just a place in vfs_mount_destroy() is left because it is
going to recycle the structure in any case, so it doesn't matter.
- Remove the markercnt refcount as it is useless.

This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and
__FreeBSD_version will be modified accordingly.

Discussed with: kib
Tested by: pho


# 15bc6b2b 28-Oct-2008 Edward Tomasz Napierala <trasz@FreeBSD.org>

Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary
to add more V* constants, and the variables changed by this patch were often
being assigned to mode_t variables, which is 16 bit.

Approved by: rwatson (mentor)


# 3ba28ace 21-Oct-2008 Konstantin Belousov <kib@FreeBSD.org>

Change vn_start_write() to clear *mpp on all failures when non-NULL vp
is supplied, since vm_pageout_scan() expects it to be cleared on error.

Submitted by: tegge
PR: 123768
MFC after: 1 week


# 016f98f9 20-Oct-2008 Konstantin Belousov <kib@FreeBSD.org>

Assert that v_holdcnt is non-zero before entering lockmgr in vn_lock
and ffs_lock. This cannot catch situations where holdcnt is incremented
not by curthread, but I think it is useful.

Reviewed by: tegge, attilio
Tested by: pho
MFC after: 2 weeks


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 0fbbf2ea 20-Sep-2008 Konstantin Belousov <kib@FreeBSD.org>

Initialize va_rdev to NODEV and va_fsid to VNOVAL before the
VOP_GETATTR() call in vn_stat(). Thus if a file system doesn't
initialize those fields in VOP_GETATTR() they will have a sane default
value.

Submitted by: Jaakko Heinonen <jh saunalahti fi>
Discussed on: freebsd-fs
MFC after: 1 month


# ea60a5f5 20-Sep-2008 Konstantin Belousov <kib@FreeBSD.org>

Initialize birthtime fields in vn_stat() to prevent stat(2) from
returning uninitialized birthtime. Most file systems don't initialize
birthtime properly in their VOP_GETTATTR().

Submitted by: Jaakko Heinonen <jh saunalahti fi>
Reviewed by: bde
Discussed on: freebsd-fs
MFC after: 1 month


# 2814d5ba 16-Sep-2008 Konstantin Belousov <kib@FreeBSD.org>

When attempt is made to suspend a filesystem that is already syspended,
wait until the current suspension is lifted instead of silently returning
success immediately. The consequences of calling vfs_write() resume when
not owning the suspension are not well-defined at best.

Add the vfs_susp_clean() mount method to be called from
vfs_write_resume(). Set it to process_deferred_inactive() for ffs, and
stop calling it manually.

Add the thread flag TDP_IGNSUSP that allows to bypass the suspension
point in the vn_start_write. It is intended for use by VFS in the
situations where the suspender want to do some i/o requiring calls to
vn_start_write(), and this i/o cannot be done later.

Reviewed by: tegge
In collaboration with: pho
MFC after: 1 month


# bdb80947 16-Sep-2008 Konstantin Belousov <kib@FreeBSD.org>

Garbage-collect vn_write_suspend_wait().

Suggested and reviewed by: tegge
Tested by: pho
MFC after: 1 month


# 0359a12e 28-Aug-2008 Attilio Rao <attilio@FreeBSD.org>

Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


# 1d986c5f 03-Aug-2008 Robert Watson <rwatson@FreeBSD.org>

Remove broken code to replace st_mode value with ACCESSPERMS when
lstat(2) is called on symlinks -- this code appears never to have
worked. The PR this addresses suggests that the intended
original behavior is the right one, but as bde points out in the
PR comments, we do actually support storing a mode on symlinks,
so returning it seems reasonable.

This is consistent with Mac OS X, which despite documentation to
the contrary does return the mode set on a symlink, but not some
other platforms. The Single Unix Spec requires only that the
returned bits be "meaningful", which seems at best unhelpful as
advice goes.

PR: 25018
MFC after: 3 days


# e314f69f 31-Mar-2008 Konstantin Belousov <kib@FreeBSD.org>

Add the support for the O_EXEC open(2) mode, as specified by the
POSIX Extended API Set Part 2 extension specification.

Reviewed by: rwatson, rdivacky
Tested by: pho


# 5634d486 29-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Don't allow calls to vn_lock() with no lock type requested. Callers
which simply want a reference should use vref(). Callers which want
to check validity need to hold a lock while performing any action
based on that validity. vn_lock() would always release the interlock
before returning making any action synchronous with the validity check
impossible.


# 804e60d4 23-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Don't acquire the vnode interlock in _vn_lock() unless no lock type
is requested. Handle this case specially before the while loop.
- Use the held vnode lock to check for VI_DOOMED. The vnode lock and
interlock must both be held to set VI_DOOMED so either one held, even
shared, is sufficient to check it.

No objection by: kib


# 22db15c0 13-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


# cb05b60a 09-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


# e4650294 07-Jan-2008 John Baldwin <jhb@FreeBSD.org>

Make ftruncate a 'struct file' operation rather than a vnode operation.
This makes it possible to support ftruncate() on non-vnode file types in
the future.
- 'struct fileops' grows a 'fo_truncate' method to handle an ftruncate() on
a given file descriptor.
- ftruncate() moves to kern/sys_generic.c and now just fetches a file
object and invokes fo_truncate().
- The vnode-specific portions of ftruncate() move to vn_truncate() in
vfs_vnops.c which implements fo_truncate() for vnode file types.
- Non-vnode file types return EINVAL in their fo_truncate() method.

Submitted by: rwatson


# 92838485 05-Jan-2008 Bruce Evans <bde@FreeBSD.org>

In sequential_heuristic():
- spell 16384 as 16384 and not as BKVASIZE. 16384 is (not quite) just a
magic size that works well in practice. BKVASIZE should be MAXBSIZE
(65536), but is 16384 because i386's don't have enough kva for it to
be MAXBSIZE; 16384 works (not so well) for it for much the same reasons
that it works well in the heuristic.
- expand and/or add comments about this and other details.
- don't explicitly inline this function.
- fix some other style bugs.


# 397c19d1 29-Dec-2007 Jeff Roberson <jeff@FreeBSD.org>

Remove explicit locking of struct file.
- Introduce a finit() which is used to initailize the fields of struct file
in such a way that the ops vector is only valid after the data, type,
and flags are valid.
- Protect f_flag and f_count with atomic operations.
- Remove the global list of all files and associated accounting.
- Rewrite the unp garbage collection such that it no longer requires
the global list of all files and instead uses a list of all unp sockets.
- Mark sockets in the accept queue so we don't incorrectly gc them.

Tested by: kris, pho


# 30d239bc 24-Oct-2007 Robert Watson <rwatson@FreeBSD.org>

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


# 57fd3d55 26-Jul-2007 Pawel Jakub Dawidek <pjd@FreeBSD.org>

When we do open, we should lock the vnode exclusively. This fixes few races:
- fifo race, where two threads assign v_fifoinfo,
- v_writecount modifications,
- v_object modifications,
- and probably more...

Discussed with: kib, ups
Approved by: re (rwatson)


# 9e223287 31-May-2007 Konstantin Belousov <kib@FreeBSD.org>

Revert UF_OPENING workaround for CURRENT.
Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation
argument from being file descriptor index into the pointer to struct file.

Proposed and reviewed by: jhb
Reviewed by: daichi (unionfs)
Approved by: re (kensmith)


# d413d210 18-May-2007 Konstantin Belousov <kib@FreeBSD.org>

Since renaming of vop_lock to _vop_lock, pre- and post-condition
function calls are no more generated for vop_lock.
Rename _vop_lock to vop_lock1 to satisfy tools/vnode_if.awk assumption
about vop naming conventions. This restores pre/post-condition calls.


# c6b342f8 17-May-2007 Peter Wemm <peter@FreeBSD.org>

Eliminate a micro-optimization that hasn't had any effect for 15+ years.


# 87aabdc1 12-Feb-2007 Mike Pritchard <mpp@FreeBSD.org>

Add a VNASSERT to vn_close to detect if v_writecount is going
to become negative. This will detect the underflow when it
happens, instead of having it discovered when the vnode is
taken off the freelist, long after the offending process is long
gone.


# 2f6a774b 12-Nov-2006 Kip Macy <kmacy@FreeBSD.org>

change vop_lock handling to allowing tracking of callers' file and line for
acquisition of lockmgr locks

Approved by: scottl (standing in for mentor rwatson)


# acd3428b 06-Nov-2006 Robert Watson <rwatson@FreeBSD.org>

Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges. These may
require some future tweaking.

Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>


# aed55708 22-Oct-2006 Robert Watson <rwatson@FreeBSD.org>

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# 92c08499 24-Jun-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Simplify the code and remove two mutex operations.

MFC after: 2 weeks


# 6befa6ae 16-May-2006 Paul Saab <ps@FreeBSD.org>

Allow concurrent read(2)/readv(2) access to a file.
Lock file offset against multiple read calls.

Submitted by: ups
Obtained from: Yahoo!
MFC after: 2 weeks


# 122410ee 28-Apr-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

vn_start_write() is called only when v_type != VCHR, so corresponding
vn_finished_write() should also be called only then.

BTW. I fixed two functions here: vn_rdwr() and vn_write(). The latter seems
to be unused.

MFC after: 3 weeks


# 3bbd6d8a 30-Mar-2006 Jeff Roberson <jeff@FreeBSD.org>

- Release the references acquired by VOP_GETWRITEMOUNT and vfs_getvfs().

Discussed with: tegge
Tested by: kris
Sponsored by: Isilon Systems, Inc.


# 861dab08 28-Mar-2006 John Baldwin <jhb@FreeBSD.org>

Change vn_open() to honor the MPSAFE flag in the passed in nameidata object
and use that instead of testing fdidx against -1 to determine if it should
release Giant if Giant was locked due to the requested file residing on a
non-MPSAFE VFS.

Discussed with: jeff


# bacb51fb 21-Mar-2006 Jeff Roberson <jeff@FreeBSD.org>

- Remove explicit giant acquires and replace it with VFS_LOCK_GIANT.

Sponsored by: Isilon Systems, Inc.


# a19fd0e7 11-Mar-2006 Christian S.J. Peron <csjp@FreeBSD.org>

Make sure that we are adding a path token to the audit record in open(2).
Do this by making sure we are using the AUDITVNODE1 mask in the namei flags.

Obtained from: TrustedBSD Project


# ca2fa807 10-Mar-2006 Tor Egge <tegge@FreeBSD.org>

Block secondary writes while expunging active unlinked files.

Fix detection of active unlinked files by checking VI_OWEINACT and
VI_DOINGINACT in addition to v_usecount.

Defer inactive handling for unlinked files if the file system is mostly
suspended (secondary writes being blocked).

Perform deferred inactive handling after the file system is resumed.


# 791dd2fa 08-Mar-2006 Tor Egge <tegge@FreeBSD.org>

Use vn_start_secondary_write() and vn_finished_secondary_write() as a
replacement for vn_write_suspend_wait() to better account for secondary write
processing.

Close race where secondary writes could be started after ffs_sync() returned
but before the file system was marked as suspended.

Detect if secondary writes or softdep processing occurred during vnode sync
loop in ffs_sync() and retry the loop if needed.


# 0430a5e2 13-Dec-2005 Dag-Erling Smørgrav <des@FreeBSD.org>

Eradicate caddr_t from the VFS API.


# e8ddb61d 02-Aug-2005 Jeff Roberson <jeff@FreeBSD.org>

- Replace the series of DEBUG_LOCKS hacks which tried to save the vn_lock
caller by saving the stack of the last locker/unlocker in lockmgr. We
also put the stack in KTR at the moment.

Contributed by: Antoine Brodin <antoine.brodin@laposte.net>


# dbb3ec5c 13-Jun-2005 Jeff Roberson <jeff@FreeBSD.org>

- Remove vnode lock asserts at the end of vfs syscalls. These asserts were
used to ensure that we weren't exiting the syscall with a lock still
held. This wasn't safe, however, because we'd already executed a vput()
and on a loaded system the vnode may have been free'd by the time we
assert. This functionality is also handled by the td_locks assert in
userret, which doesn't tell you what the syscall was, but will at least
panic before you deadlock.

Sponsored by: Isilon Systems, Inc.
Discovred by: Peter Holm
Approved by: re (blanket vfs)


# d598b04d 12-Jun-2005 Jeff Roberson <jeff@FreeBSD.org>

- It has long been my suspicion that we don't actually need a loop in
vn_lock(). Add an assert that will help me gain more confidence that this
is correct.

Sponsored by: Isilon Systems, Inc.


# 54981733 27-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Stop checking vxthread, we've asserted that it was useless for several
weeks.


# 7625cbf3 27-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Pass the ISOPEN flag to namei so filesystems will know we're about to
open them or otherwise access the data.


# 1b19c74d 11-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Assert that we're no longer doing recursive vn_locks in inactive/reclaim
as I'd like to get rid of the vxthread.
- Handle lock requests which don't actually want a lock as this is a
much more convenient place to handle this condition than in vget().
These requests simply want to know that VI_DOOMED isn't set.
- Correct a test at the end of vn_lock, if error !=0 should be
if error == 0, this has been broken since I comitted the VI_DOOMED
changes, but no one ran into it because vget() duplicated this
functionality.

Sponsored by: Isilon Systems, Inc.


# f3e89267 04-Apr-2005 Christian S.J. Peron <csjp@FreeBSD.org>

Assert that the vnode is locked. This is meant to catch bugs or
mis-use of the vnode API in conditions where IO_NODELOCKED has been
used without the vnode actually being locked.


# f247a524 30-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- LK_NOPAUSE is a nop now.

Sponsored by: Isilon Systems, Inc.


# 3e6bcad3 23-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Remove some long dead LOOKUP_SHARED code that tracked the lock state.
- Always pass LOCKSHARED and rely on namei() to ignore it when
LOOKUP_SHARED is not set.

Sponsored by: Isilon Systems, Inc.


# 0463dc9e 13-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Do a vn_start_write in vn_close, we may write if this is the last ref
on an unlinked file. We can't know if this is the case until after we
have the lock.
- Lock the vnode in vn_close, many filesystems had code which was unsafe
without the lock held, and holding it greatly simplifies vgone().
- Adjust vn_lock() to check for the VI_DOOMED flag where appropriate.

Sponsored by: Isilon Systems, Inc.


# cd138194 23-Feb-2005 Christian S.J. Peron <csjp@FreeBSD.org>

Add locking assertions into vn_extattr_set, vn_extattr_get and
vn_extattr_rm. This is meant to catch conditions where IO_NODELOCKED
has been specified without the vnode being locked.

Discussed with: rwatson
MFC after: 1 week


# 4d8ac58b 17-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce vx_wait{l}() and use it instead of home-rolled versions.


# dcff5b14 24-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Don't call VOP_CREATEVOBJECT(), it's the responsibility of the
filesystem which owns the vnode.


# f50a2d5e 24-Jan-2005 Jeff Roberson <jeff@FreeBSD.org>

- Remove GIANT_REQUIRED where giant is no longer required.
- Protect access to mnt_kern_flag with the mountpoint mutex.
- Use the appropriate nd flags to deal with giant in vn_open_cred().
We currently determine whether the caller is mpsafe by checking
for a valid fdidx. Any caller coming from user-space is now
mpsafe and supplies a valid fd. No kenrel callers have been
converted to mpsafe, so this check is sufficient for now.
- Use VFS_LOCK_GIANT instead of manual giant acquisition where
appropriate.

Sponsored By: Isilon Systems, Inc.


# e39db32a 12-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Ditch vfs_object_create() and make the callers call VOP_CREATEVOBJECT()
directly.


# 8df6bac4 11-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC().

I'm not sure why a credential was added to these in the first place, it is
not used anywhere and it doesn't make much sense:

The credentials for syncing a file (ability to write to the
file) should be checked at the system call level.

Credentials for syncing one or more filesystems ("none")
should be checked at the system call level as well.

If the filesystem implementation needs a particular credential
to carry out the syncing it would logically have to the
cached mount credential, or a credential cached along with
any delayed write data.

Discussed with: rwatson


# 9454b2d8 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for copyright notices, minor format tweaks as necessary


# 18dc7373 18-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Ok, first blunder: ioctls are not entirely unused on vnodes anymore :-)

Add dropped call to VOP_IOCTL().


# a0fbccc9 17-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Push Giant down through ioctl.

Don't grab Giant in the upper syscall/wrapper code

NET_LOCK_GIANT in the socket code (sockets/fifos).

mtx_lock(&Giant) in the vnode code.

mtx_lock(&Giant) in the opencrypto code. (This may actually not be
needed, but better safe than sorry).

Devfs grabs Giant if the driver is marked as needing Giant.


# db446e30 17-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Push Giant down through select and poll.

Don't grab Giant in the upper syscall/wrapper code

NET_LOCK_GIANT in the socket code (sockets/fifos).

mtx_lock(&Giant) in the vnode code.

Devfs grabs Giant if the driver is marked as needing Giant.


# f6083975 15-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Give vn_poll single exit point (to make it easier to insert
"mtx_unlock(&Giant)" real soon now).


# c5b846fe 10-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Slim vnodes by another four bytes by eliminating the (now) unused field
v_cachedid.


# b797084e 09-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove vnode->v_cachedfs.

It was only used for the highly dangerous "export all vnodes with a sysctl"
function.


# 5d9d81e7 26-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Put the I/O block size in bufobj->bo_bsize.

We keep si_bsize_phys around for now as that is the simplest way to pull
the number out of disk device drivers in devfs_open(). The correct solution
would be to do an ioctl(DIOCGSECTORSIZE), but the point is probably mooth
when filesystems sit on GEOM, so don't bother for now.


# 9b7cc97f 26-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unused si_bsize_best field from struct cdev.


# 6e8d4202 24-Sep-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Hold dev_lock and check for NULL devsw pointer when we service FIODTYPE ioctl.


# 90a660e1 21-Sep-2004 Poul-Henning Kamp <phk@FreeBSD.org>

If a vnode has no v_rdev we cannot hope to answer FIODTYPE ioctl.


# ad3b9257 15-Aug-2004 John-Mark Gurney <jmg@FreeBSD.org>

Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by: green, rwatson (both earlier versions)


# db532b63 06-Aug-2004 Robert Watson <rwatson@FreeBSD.org>

Flag a broad range of VFS operations as GIANT_REQUIRED in order to
catch leaking into VFS without Giant.

Inch Giant a little lower in several file descriptor operations on
vnodes to cover only VFS operations that need it, rather than file
flag reading, etc.


# a6719c82 22-Jul-2004 Robert Watson <rwatson@FreeBSD.org>

Push Giant acquisition down into fo_stat() from most callers. Acquire
Giant conditional on debug.mpsafenet in the socket soo_stat() routine,
unconditionally in vn_statfile() for VFS, and otherwise don't acquire
Giant. Accept an unlocked read in kqueue_stat(), and cryptof_stat() is
a no-op. Don't acquire Giant in fstat() system call.

Note: in fdescfs, fo_stat() is called while holding Giant due to the VFS
stack sitting on top, and therefore there will still be Giant recursion
in this case.


# 1c1ce925 22-Jul-2004 Robert Watson <rwatson@FreeBSD.org>

Push acquisition of Giant from fdrop_closed() into fo_close() so that
individual file object implementations can optionally acquire Giant if
they require it:

- soo_close(): depends on debug.mpsafenet
- pipe_close(): Giant not acquired
- kqueue_close(): Giant required
- vn_close(): Giant required
- cryptof_close(): Giant required (conservative)

Notes:

Giant is still acquired in close() even when closing MPSAFE objects
due to kqueue requiring Giant in the calling closef() code.
Microbenchmarks indicate that this removal of Giant cuts 3%-3% off
of pipe create/destroy pairs from user space with SMP compiled into
the kernel.

The cryptodev and opencrypto code appears MPSAFE, but I'm unable to
test it extensively and so have left Giant over fo_close(). It can
probably be removed given some testing and review.


# 32240d08 10-Jul-2004 Marcel Moolenaar <marcel@FreeBSD.org>

Update for the KDB framework:
o Call kdb_enter() instead of Debugger().


# f99619a0 04-Jun-2004 Tim J. Robbins <tjr@FreeBSD.org>

Change the types of vn_rdwr_inchunks()'s len and aresid arguments to
size_t and size_t *, respectively. Update callers for the new interface.
This is a better fix for overflows that occurred when dumping segments
larger than 2GB to core files.


# f3d055b6 01-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Rather than assert f_type==DTYPE_VNODE, conditionally perform the
file lock release based on f_type==DTYPE_VNODE. vn_closefile() is
used by non-vnode types as well (fifo).


# 63732dce 01-Jun-2004 Robert Watson <rwatson@FreeBSD.org>

Push the VOP_ADVLOCK() call to release advisory locks on vnode file
descriptors out of fdrop_locked() and into vn_closefile(). This
removes all knowledge of vnodes from fdrop_locked(), since the lock
behavior was specific to vnodes. This also removes the specific
requirement for Giant in fdrop_locked(), it's now only required by
code that it calls into.

Add GIANT_REQUIRED to vn_closefile() since VFS requires Giant.


# e79962db 31-May-2004 Robert Watson <rwatson@FreeBSD.org>

Assert Giant in vn_start_write() and vn_finished_write().


# 7f8a436f 05-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


# 0249823e 12-Mar-2004 Bruce Evans <bde@FreeBSD.org>

Align the offset in vn_rdwr_inchunks() so that at most the first and
the last chunk are misaligned relative to a MAXBSIZE byte boundary.
vn_rdwr_inchunks() is used mainly for elf core dumps, and elf sections
are usually perfectly misaligned relative to MAXBSIZE, and chunking
prevents the file system from doing much realigning.

This gives a surprisingly large speedup for core dumps -- from 50 to
13 seconds for a 512MB core dump here. The pessimization was mostly
from an interaction of the misalignment with IO_DIRECT. It increased
the number of i/o's for each chunk by a factor of 5 (3 writes and 2
read-before-writes instead of 1 write).


# 9efe7d9d 28-Dec-2003 Bruce Evans <bde@FreeBSD.org>

v_vxproc was a bogus name for a thread (pointer).


# 1de1f935 04-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- If we are called with LK_NOWAIT in vn_lock() we may be holding a mutex
and should not sleep while waiting for XLOCK to clear. Care needs to be
taken in functions that use this capability to avoid spinning.


# 9080ff25 28-Jul-2003 Robert Watson <rwatson@FreeBSD.org>

Rename VOP_RMEXTATTR() to VOP_DELETEEXTATTR() for consistency with the
kernel ACL interfaces and system call names.

Break out UFS2 and FFS extattr delete and list vnode operations from
setextattr and getextattr to deleteextattr and listextattr, which
cleans up the implementations, and makes the results more readable,
and makes the APIs more clear.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 3ab6b09c 27-Jul-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Pass the fdidx argument from vn_open{_cred}() onto VOP_OPEN()


# 7c89f162 27-Jul-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Add fdidx argument to vn_open() and vn_open_cred() and pass -1 throughout.


# a8d43c90 26-Jul-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Add a "int fd" argument to VOP_OPEN() which in the future will
contain the filedescriptor number on opens from userland.

The index is used rather than a "struct file *" since it conveys a bit
more information, which may be useful to in particular fdescfs and /dev/fd/*

For now pass -1 all over the place.


# 6b42f0a2 22-Jun-2003 Robert Watson <rwatson@FreeBSD.org>

Prefer the vop_rmextattr() vnode operation for removing extended
attributes from objects over vop_setextattr() with a NULL uio; if
the file system doesn't support the vop_rmextattr() method, fall
back to the vop_setextattr() method.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 3b6d9652 22-Jun-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Add a f_vnode field to struct file.

Several of the subtypes have an associated vnode which is used for
stuff like the f*() functions.

By giving the vnode a speparate field, a number of checks for the specific
subtype can be replaced simply with a check for f_vnode != NULL, and
we can later free f_data up to subtype specific use.

At this point in time, f_data still points to the vnode, so any code I
might have overlooked will still work.


# 2db4b023 18-Jun-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce a new flag on a file descriptor: DFLAG_SEEKABLE and use that
rather than assume that only DTYPE_VNODE is seekable.


# 7c2d2efd 18-Jun-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Initialize struct fileops with C99 sparse initialization.


# 677b542e 10-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# 0b955134 03-Jun-2003 Robert Watson <rwatson@FreeBSD.org>

Assert the vnode lock when returning successfully from vn_open_cred().


# 104a9b7e 29-Apr-2003 Alexander Kabaev <kan@FreeBSD.org>

Deprecate machine/limits.h in favor of new sys/limits.h.
Change all in-tree consumers to include <sys/limits.h>

Discussed on: standards@
Partially submitted by: Craig Rodrigues <rodrigc@attbi.com>


# 128a0bb7 26-Mar-2003 Tor Egge <tegge@FreeBSD.org>

fp->f_offset doesn't need any protection when it isn't accessed.


# e7d6662f 14-Feb-2003 Alfred Perlstein <alfred@FreeBSD.org>

Do not allow kqueues to be passed via unix domain sockets.


# 48e3128b 12-Jan-2003 Matthew Dillon <dillon@FreeBSD.org>

Bow to the whining masses and change a union back into void *. Retain
removal of unnecessary casts and throw in some minor cleanups to see if
anyone complains, just for the hell of it.


# cd72f218 11-Jan-2003 Matthew Dillon <dillon@FreeBSD.org>

Change struct file f_data to un_data, a union of the correct struct
pointer types, and remove a huge number of casts from code using it.

Change struct xfile xf_data to xun_data (ABI is still compatible).

If we need to add a #define for f_data and xf_data we can, but I don't
think it will be necessary. There are no operational changes in this
commit.


# c579babe 07-Jan-2003 Brian Feldman <green@FreeBSD.org>

In vn_open(), unset ndp->ni_vp when returning failure so that code
which expects it to be NULL unless the return value was 0 will work.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 45587e25 28-Dec-2002 Matthew Dillon <dillon@FreeBSD.org>

Abstract-out the constants for the sequential heuristic.

No operational changes.

MFC after: 1 day


# a7010ee2 24-Dec-2002 Poul-Henning Kamp <phk@FreeBSD.org>

White-space changes.


# f3a68211 23-Dec-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Detediousficate declaration of fileops array members by introducing
typedefs for them.


# 9ab73fd1 24-Oct-2002 Kirk McKusick <mckusick@FreeBSD.org>

Within ufs, the ffs_sync and ffs_fsync functions did not always
check for and/or report I/O errors. The result is that a VFS_SYNC
or VOP_FSYNC called with MNT_WAIT could loop infinitely on ufs in
the presence of a hard error writing a disk sector or in a filesystem
full condition. This patch ensures that I/O errors will always be
checked and returned. This patch also ensures that every call to
VFS_SYNC or VOP_FSYNC with MNT_WAIT set checks for and takes
appropriate action when an error is returned.

Sponsored by: DARPA & NAI Labs.


# 89c61753 19-Oct-2002 Robert Watson <rwatson@FreeBSD.org>

Drop in the MAC check for file creation as part of open().

Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 3c275c19 26-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Under DIAGNOSTIC, complain if ENOIOCTL leaks out through VOP_IOCTL().


# 93b0017f 25-Aug-2002 Philippe Charnier <charnier@FreeBSD.org>

Replace various spelling with FALLTHROUGH which is lint()able


# ad32f726 22-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Fix a mistake in my last few commits. The PDROP flag stops msleep from
re-acquiring the mutex.

Pointy hat to: me
Noticed by: tegge


# 4b6049ca 22-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Closer inspection revealed a possible deadlock situation in vn_lock() that
was introduced by my last commit but not caught by stress testing. Fix
that and slightly restructure the code so that it is more readable.


# 9abf54f0 22-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Make vn_lock() vget() and VOP_LOCK() all behave the same way WRT
LK_INTERLOCK. The interlock will never be held on return from these
functions even when there is an error. Errors typically only occur when
the XLOCK is held which means this isn't the vnode we want anyway. Almost
all users of these interfaces expected this behavior even though it was
not provided before.


# 510939d0 22-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Return two shared locks to exclusive locks. This was premature.
- Document the problems that prevent us from using shared locks.


# 6c54a1f5 22-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Fix interlock handling in vn_lock(). Previously, vn_lock() could return
with interlock held in error conditions when the caller did not specify
LK_INTERLOCK.
- Add several comments to vn_lock() describing the rational behind the code
flow since it was not immediately obvious.


# 0b600db4 21-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Document two cases, one in vget and the other in vn_lock, where the state
of interlock on exit is not consistent. There are probably several bugs
relating to this.


# 177142e4 19-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Pass active_cred and file_cred into the MAC framework explicitly
for mac_check_vnode_{poll,read,stat,write}(). Pass in fp->f_cred
when calling these checks with a struct file available. Otherwise,
pass NOCRED. All currently MAC policies use active_cred, but
could now offer the cached credential semantic used for the base
system security model.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 7f724f8b 19-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Break out mac_check_vnode_op() into three seperate checks:
mac_check_vnode_poll(), mac_check_vnode_read(), mac_check_vnode_write().
This improves the consistency with other existing vnode checks, and
allows policies to avoid implementing switch statements to determine
what operations they do and do not want to authorize.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# d49fa1ca 16-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

In continuation of early fileop credential changes, modify fo_ioctl() to
accept an 'active_cred' argument reflecting the credential of the thread
initiating the ioctl operation.

- Change fo_ioctl() to accept active_cred; change consumers of the
fo_ioctl() interface to generally pass active_cred from td->td_ucred.
- In fifofs, initialize filetmp.f_cred to ap->a_cred so that the
invocations of soo_ioctl() are provided access to the calling f_cred.
Pass ap->a_td->td_ucred as the active_cred, but note that this is
required because we don't yet distinguish file_cred and active_cred
in invoking VOP's.
- Update kqueue_ioctl() for its new argument.
- Update pipe_ioctl() for its new argument, pass active_cred rather
than td_ucred to MAC for authorization.
- Update soo_ioctl() for its new argument.
- Update vn_ioctl() for its new argument, use active_cred rather than
td->td_ucred to authorize VOP_IOCTL() and the associated VOP_GETATTR().

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# ea6027a8 15-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Make similar changes to fo_stat() and fo_poll() as made earlier to
fo_read() and fo_write(): explicitly use the cred argument to fo_poll()
as "active_cred" using the passed file descriptor's f_cred reference
to provide access to the file credential. Add an active_cred
argument to fo_stat() so that implementers have access to the active
credential as well as the file credential. Generally modify callers
of fo_stat() to pass in td->td_ucred rather than fp->f_cred, which
was redundantly provided via the fp argument. This set of modifications
also permits threads to perform these operations on behalf of another
thread without modifying their credential.

Trickle this change down into fo_stat/poll() implementations:

- badfo_poll(), badfo_stat(): modify/add arguments.
- kqueue_poll(), kqueue_stat(): modify arguments.
- pipe_poll(), pipe_stat(): modify/add arguments, pass active_cred to
MAC checks rather than td->td_ucred.
- soo_poll(), soo_stat(): modify/add arguments, pass fp->f_cred rather
than cred to pru_sopoll() to maintain current semantics.
- sopoll(): moidfy arguments.
- vn_poll(), vn_statfile(): modify/add arguments, pass new arguments
to vn_stat(). Pass active_cred to MAC and fp->f_cred to VOP_POLL()
to maintian current semantics.
- vn_close(): rename cred to file_cred to reflect reality while I'm here.
- vn_stat(): Add active_cred and file_cred arguments to vn_stat()
and consumers so that this distinction is maintained at the VFS
as well as 'struct file' layer. Pass active_cred instead of
td->td_ucred to MAC and to VOP_GETATTR() to maintain current semantics.

- fifofs: modify the creation of a "filetemp" so that the file
credential is properly initialized and can be used in the socket
code if desired. Pass ap->a_td->td_ucred as the active
credential to soo_poll(). If we teach the vnop interface about
the distinction between file and active credentials, we would use
the active credential here.

Note that current inconsistent passing of active_cred vs. file_cred to
VOP's is maintained. It's not clear why GETATTR would be authorized
using active_cred while POLL would be authorized using file_cred at
the file system level.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 9ca43589 15-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

In order to better support flexible and extensible access control,
make a series of modifications to the credential arguments relating
to file read and write operations to cliarfy which credential is
used for what:

- Change fo_read() and fo_write() to accept "active_cred" instead of
"cred", and change the semantics of consumers of fo_read() and
fo_write() to pass the active credential of the thread requesting
an operation rather than the cached file cred. The cached file
cred is still available in fo_read() and fo_write() consumers
via fp->f_cred. These changes largely in sys_generic.c.

For each implementation of fo_read() and fo_write(), update cred
usage to reflect this change and maintain current semantics:

- badfo_readwrite() unchanged
- kqueue_read/write() unchanged
pipe_read/write() now authorize MAC using active_cred rather
than td->td_ucred
- soo_read/write() unchanged
- vn_read/write() now authorize MAC using active_cred but
VOP_READ/WRITE() with fp->f_cred

Modify vn_rdwr() to accept two credential arguments instead of a
single credential: active_cred and file_cred. Use active_cred
for MAC authorization, and select a credential for use in
VOP_READ/WRITE() based on whether file_cred is NULL or not. If
file_cred is provided, authorize the VOP using that cred,
otherwise the active credential, matching current semantics.

Modify current vn_rdwr() consumers to pass a file_cred if used
in the context of a struct file, and to always pass active_cred.
When vn_rdwr() is used without a file_cred, pass NOCRED.

These changes should maintain current semantics for read/write,
but avoid a redundant passing of fp->f_cred, as well as making
it more clear what the origin of each credential is in file
descriptor read/write operations.

Follow-up commits will make similar changes to other file descriptor
operations, and modify the MAC framework to pass both credentials
to MAC policy modules so they can implement either semantic for
revocation.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 0231c03d 12-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Implement IO_NOMACCHECK in vn_rdwr() -- perform MAC checks (assuming
'options MAC') as long as IO_NOMACCHECK is not set in the IO flags.
If IO_NOMACCHECK is set, bypass MAC checks in vn_rdwr(). This allows
vn_rdwr() to be used as a utility function inside of file systems
where MAC checks have already been performed, or where the operation
is being done on behalf of the kernel not the user.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI LAbs


# 92e35b60 07-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Due to layering problems, remove the MAC checks from vn_rdwr() -- this
VOP wrapper is called from within file systems so can result in odd
loopback effects when MAC enforcement is use with the active (as
opposed to saved) credential. These checks will be moved elsewhere.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# e6e370a7 04-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Replace v_flag with v_iflag and v_vflag
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.

Idea stolen from: BSD/OS


# ee0812f3 01-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Since we have the struct file data pointer cached in vp, use that
instead when invoking VOP_POLL().


# 4a58340e 01-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Introduce support for Mandatory Access Control and extensible
kernel access control

Invoke appropriate MAC framework entry points to authorize a number
of vnode operations, including read, write, stat, poll. This permits
MAC policies to revoke access to files following label changes,
and to limit information spread about the file to user processes.

Note: currently the file cached credential is used for some of
these authorization check. We will need to expand some of the
MAC entry point APIs to permit multiple creds to be passed to
the access control check to allow diverse policy behavior.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 37bde6c0 01-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Introduce support for Mandatory Access Control and extensible
kernel access control.

Restructure the vn_open_cred() access control checks to invoke
the MAC entry point for open authorization. Note that MAC can
reject open requests where existing DAC code skips the open
authorization check due to O_CREAT. However, the failure mode
here is the same as other failure modes following creation,
wherein an empty file may be left behind.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 4eee8de7 30-Jul-2002 Dag-Erling Smørgrav <des@FreeBSD.org>

Introduce struct xvnode, which will be used instead of struct vnode for
sysctl purposes. Also add two fields to struct vnode, v_cachedfs and
v_cachedid, which hold the vnode's device and file id and are filled in
by vn_open_cred() and vn_stat().

Sponsored by: DARPA, NAI Labs


# 0b1040cb 21-Jul-2002 Robert Watson <rwatson@FreeBSD.org>

Set VAPPEND in open mode when O_APPEND is specified as an argument to
open() of fhopen(). Currently this has no actual affect due to the
treatment of VAPPEND in vaccess() and vaccess_acl() as a subset of
VWRITE, but when MAC comes in, MAC will distinguish the two. Note:
if any file systems are cutting their own permission models, they
may wish to now take this into account.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# faab4e27 16-Jul-2002 Kirk McKusick <mckusick@FreeBSD.org>

Change the name of st_createtime to st_birthtime. This change is
made to reduce confusion between st_ctime and st_createtime.

Submitted by: Eric Allman <eric@sendmail.org>
Sponsored by: DARPA & NAI Labs.


# 7f05b035 28-Jun-2002 Alfred Perlstein <alfred@FreeBSD.org>

More caddr_t removal, make fo_ioctl take a void * instead of a caddr_t.


# 5c71bc6c 28-Jun-2002 Jeff Roberson <jeff@FreeBSD.org>

Clean up vn_rdwr locking.

- Do shared locks on read.
- Only do vn_{start,finished}_write when writing.


# d374764f 24-Jun-2002 Kirk McKusick <mckusick@FreeBSD.org>

Use proper size in bzero of stat structure.

Submitted by: Jake Burkholder <jake@locore.ca>
Sponsored by: DARPA & NAI Labs.


# 6524dddc 22-Jun-2002 Kirk McKusick <mckusick@FreeBSD.org>

This patch fixes a size problem with the stat structure for
64-bit architectures that was introduced in the UFS2 code
merge two days ago. The stat structure change that caused
the problem was the addition of the file create time.

Submitted by: Bruce Evans <bde@zeta.org.au>
Sponsored by: DARPA & NAI Labs.


# 1c85e6a3 21-Jun-2002 Kirk McKusick <mckusick@FreeBSD.org>

This commit adds basic support for the UFS2 filesystem. The UFS2
filesystem expands the inode to 256 bytes to make space for 64-bit
block pointers. It also adds a file-creation time field, an ability
to use jumbo blocks per inode to allow extent like pointer density,
and space for extended attributes (up to twice the filesystem block
size worth of attributes, e.g., on a 16K filesystem, there is space
for 32K of attributes). UFS2 fully supports and runs existing UFS1
filesystems. New filesystems built using newfs can be built in either
UFS1 or UFS2 format using the -O option. In this commit UFS1 is
the default format, so if you want to build UFS2 format filesystems,
you must specify -O 2. This default will be changed to UFS2 when
UFS2 proves itself to be stable. In this commit the boot code for
reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c)
as there is insufficient space in the boot block. Once the size of the
boot block is increased, this code can be defined.

Things to note: the definition of SBSIZE has changed to SBLOCKSIZE.
The header file <ufs/ufs/dinode.h> must be included before
<ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and
ufs_lbn_t.

Still TODO:
Verify that the first level bootstraps work for all the architectures.
Convert the utility ffsinfo to understand UFS2 and test growfs.
Add support for the extended attribute storage. Update soft updates
to ensure integrity of extended attribute storage. Switch the
current extended attribute interfaces to use the extended attribute
storage. Add the extent like functionality (framework is there,
but is currently never used).

Sponsored by: DARPA & NAI Labs.
Reviewed by: Poul-Henning Kamp <phk@freebsd.org>


# 0e2d6cc8 14-May-2002 Jeff Roberson <jeff@FreeBSD.org>

Disable the shared locking namei() code for now. It breaks several stacking
filesystems. This is on hold until the rest of VFS Locking is reviewed and
deemed safe. It can be enabled with 'options LOOKUP_SHARED'.


# ba626c1d 16-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Lock proctree_lock instead of pgrpsess_lock.


# 79a3e970 14-Apr-2002 Jeff Roberson <jeff@FreeBSD.org>

Use VOP_GETVOBJECT instead of accessing the member directly. This fixed
an issue with nullfs and NAMEI shared.

Submitted by: Alexander Kabaev


# a59f8b9e 08-Apr-2002 Jeff Roberson <jeff@FreeBSD.org>

Turn #ifdef LOOKUP_SHARED into #ifndef LOOKUP_EXCLUSIVE to enable this
behavior by default. Also, change the options line to reflect this.

If there are no problems reported this will become the only behavior and the
knob will be removed in a month or so.

Demanded by: obrien


# 44731cab 01-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on: smp@


# 237e41fc 25-Mar-2002 Bruce Evans <bde@FreeBSD.org>

Added used include of <sys/sx.h>. Don't depend on namespace pollution in
<sys/file.h>.


# 4d77a549 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# 628abf6c 15-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Giant pushdown for read/write/pread/pwrite syscalls.

kern/kern_descrip.c:
Aquire Giant in fdrop_locked when file refcount hits zero, this removes
the requirement for the caller to own Giant for the most part.

kern/kern_ktrace.c:
Aquire Giant in ktrgenio, simplifies locking in upper read/write syscalls.

kern/vfs_bio.c:
Aquire Giant in bwillwrite if needed.

kern/sys_generic.c
Giant pushdown, remove Giant for:
read, pread, write and pwrite.
readv and writev aren't done yet because of the possible malloc calls
for iov to uio processing.

kern/sys_socket.c
Grab giant in the socket fo_read/write functions.

kern/vfs_vnops.c
Grab giant in the vnode fo_read/write functions.


# 8de00f4a 11-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

This patch adds the "LOCKSHARED" option to namei which causes it to only acquire shared locks on leafs.
The stat() and open() calls have been changed to make use of this new functionality. Using shared locks in
these cases is sufficient and can significantly reduce their latency if IO is pending to these vnodes. Also,
this reduces the number of exclusive locks that are floating around in the system, which helps reduce the
number of deadlocks that occur.

A new kernel option "LOOKUP_SHARED" has been added. It defaults to off so this patch can be turned on for
testing, and should eventually go away once it is proven to be stable. I have personally been running this
patch for over a year now, so it is believed to be fully stable.

Reviewed by: jake, obrien
Approved by: jake


# 183ccde6 11-Mar-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Stop abusing the pgrpsess_lock.


# eb8e6d52 05-Mar-2002 Eivind Eklund <eivind@FreeBSD.org>

Document all functions, global and static variables, and sysctls.
Includes some minor whitespace changes, and re-ordering to be able to document
properly (e.g, grouping of variables and the SYSCTL macro calls for them, where
the documentation has been added.)

Reviewed by: phk (but all errors are mine)


# a854ed98 27-Feb-2002 John Baldwin <jhb@FreeBSD.org>

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


# f591779b 23-Feb-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Lock struct pgrp, session and sigio.

New locks are:

- pgrpsess_lock which locks the whole pgrps and sessions,
- pg_mtx which protects the pgrp members, and
- s_mtx which protects the session members.

Please refer to sys/proc.h for the coverage of these locks.

Changes on the pgrp/session interface:

- pgfind() needs the pgrpsess_lock held.

- The caller of enterpgrp() is responsible to allocate a new pgrp and
session.

- Call enterthispgrp() in order to enter an existing pgrp.

- pgsignal() requires a pgrp lock held.

Reviewed by: jhb, alfred
Tested on: cvsup.jp.FreeBSD.org
(which is a quad-CPU machine running -current)


# ec20f901 19-Feb-2002 Robert Watson <rwatson@FreeBSD.org>

More cleanups relating to vm object allocation failure: make sure we
call VOP_CLOSE() with vp unlocked; clean up the return path a little,
in as much as our namei/vnode operation return paths can be cleared
up. For a return case that was apparently never taken, this sure
is ugly.

Reviewed by: jeffr


# b01bcf4c 17-Feb-2002 Ian Dowse <iedowse@FreeBSD.org>

Add the braces missed by revision 1.131.

Pointy hat to: rwatson


# 4729fbd8 17-Feb-2002 Robert Watson <rwatson@FreeBSD.org>

When vn_open() is failing because it cannot allocate a vm object, call
VOP_CLOSE() on the vnode, so that VOP_OPEN() and VOP_CLOSE() calls
are symmetric in all failure cases. This prevents an 'open' reference
from being leaked in that unlikely failure scenario.


# 1ea030d8 10-Feb-2002 Robert Watson <rwatson@FreeBSD.org>

Make sure to hold vnode lock when calling into VOP_GETATTR().

Discussed with: mckusick, phk


# 74237f55 09-Feb-2002 Robert Watson <rwatson@FreeBSD.org>

Part I: Update extended attribute API and ABI:

o Modify the system call syntax for extattr_{get,set}_{fd,file}() so
as not to use the scatter gather API (which appeared not to be used
by any consumers, and be less portable), rather, accepts 'data'
and 'nbytes' in the style of other simple read/write interfaces.
This changes the API and ABI.

o Modify system call semantics so that extattr_get_{fd,file}() return
a size_t. When performing a read, the number of bytes read will
be returned, unless the data pointer is NULL, in which case the
number of bytes of data are returned. This changes the API only.

o Modify the VOP_GETEXTATTR() vnode operation to accept a *size_t
argument so as to return the size, if desirable. If set to NULL,
the size will not be returned.

o Update various filesystems (pseodofs, ufs) to DTRT.

These changes should make extended attributes more useful and more
portable. More commits to rebuild the system call files, as well
as update userland utilities to follow.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 3e728224 25-Jan-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Make st_blksize default to PAGE_SIZE instead of zero.


# c73df808 18-Jan-2002 Matthew Dillon <dillon@FreeBSD.org>

Remove 'VXLOCK: interlock avoided' warnings. This can now occur in normal
operation. The vgonel() code has always called vclean() but until we
started proactively freeing vnodes it would never actually be called with
a dirty vnode, so this situation did not occur prior to the vnlru() code.
Now that we proactively free vnodes when kern.maxvnodes is hit, however,
vclean() winds up with work to do and improperly generates the warnings.

Reviewed by: peter
Approved by: re (for MFC)
MFC after: 1 day


# 426da3bc 13-Jan-2002 Alfred Perlstein <alfred@FreeBSD.org>

SMP Lock struct file, filedesc and the global file list.

Seigo Tanimura (tanimura) posted the initial delta.

I've polished it quite a bit reducing the need for locking and
adapting it for KSE.

Locks:

1 mutex in each filedesc
protects all the fields.
protects "struct file" initialization, while a struct file
is being changed from &badfileops -> &pipeops or something
the filedesc should be locked.

1 mutex in each struct file
protects the refcount fields.
doesn't protect anything else.
the flags used for garbage collection have been moved to
f_gcflag which was the FILLER short, this doesn't need
locking because the garbage collection is a single threaded
container.
could likely be made to use a pool mutex.

1 sx lock for the global filelist.

struct file * fhold(struct file *fp);
/* increments reference count on a file */

struct file * fhold_locked(struct file *fp);
/* like fhold but expects file to locked */

struct file * ffind_hold(struct thread *, int fd);
/* finds the struct file in thread, adds one reference and
returns it unlocked */

struct file * ffind_lock(struct thread *, int fd);
/* ffind_hold, but returns file locked */

I still have to smp-safe the fget cruft, I'll get to that asap.


# fdb33f08 18-Dec-2001 Matthew Dillon <dillon@FreeBSD.org>

This is a forward port of Peter's vlrureclaim() fix, with some minor mods
by me to make it more efficient. The original code had serious balancing
problems and could also deadlock easily. This code relegates the vnode
reclamation to its own kproc and relaxes the vnode reclamation requirements
to better maintain kern.maxvnodes. This code still doesn't balance as well
as it could, but it does a much better job then the original code.

Approved by: re@freebsd.org
Obtained from: ps, peter, dillon
MFS Assuming: Assuming no problems crop up in Yahoo testing
MFC after: 7 days


# f03e89de 11-Nov-2001 Alfred Perlstein <alfred@FreeBSD.org>

turn vn_open() into a wrapper around vn_open_cred() which allows
one to perform a vn_open using temporary/other/fake credentials.

Modify the nfs client side locking code to use vn_open_cred() passing
proc0's ucred instead of the old way which was to temporary raise
privs while running vn_open(). This should close the race hopefully.


# fc2749a4 23-Oct-2001 Robert Watson <rwatson@FreeBSD.org>

o vn_open() fails to call VOP_CLOSE() if vfs_object_create fails. Ideally
all successful calls to VOP_OPEN() might be reflected in a call to
VOP_CLOSE(). For now, simply add a comment reflecting this problem;
this should be fixed at some point.


# 7106ca0d 11-Oct-2001 John Baldwin <jhb@FreeBSD.org>

Add missing includes of sys/lock.h.


# 3418ebeb 26-Sep-2001 Matthew Dillon <dillon@FreeBSD.org>

Make uio_yield() a global. Call uio_yield() between chunks
in vn_rdwr_inchunks(), allowing other processes to gain an exclusive
lock on the vnode. Specifically: directory scanning, to avoid a race to the
root directory, and multiple child processes coring simultaniously so they
can figure out that some other core'ing child has an exclusive adv lock and
just exit instead.

This completely fixes performance problems when large programs core. You
can have hundreds of copies (forked children) of the same binary core all
at once and not notice.

MFC after: 3 days


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 06ae1e91 08-Sep-2001 Matthew Dillon <dillon@FreeBSD.org>

This brings in a Yahoo coredump patch from Paul, with additional mods by
me (addition of vn_rdwr_inchunks). The problem Yahoo is solving is that
if you have large process images core dumping, or you have a large number of
forked processes all core dumping at the same time, the original coredump code
would leave the vnode locked throughout. This can cause the directory vnode
to get locked up, which can cause the parent directory vnode to get locked
up, and so on all the way to the root node, locking the entire machine up
for extremely long periods of time.

This patch solves the problem in two ways. First it uses an advisory
non-blocking lock to abort multiple processes trying to core to the same
file. Second (my contribution) it chunks up the writes and uses bwillwrite()
to avoid holding the vnode locked while blocking in the buffer cache.

Submitted by: ps
Reviewed by: dillon
MFC after: 2 weeks


# 5d97bedb 23-Aug-2001 Andrey A. Chernov <ache@FreeBSD.org>

vn_stat(): if va_size (u_quad_t) > OFF_MAX, return EOVERFLOW, don't copy it
blindly to st_size


# ac8f990b 24-May-2001 Matthew Dillon <dillon@FreeBSD.org>

This patch implements O_DIRECT about 80% of the way. It takes a patchset
Tor created a while ago, removes the raw I/O piece (that has cache coherency
problems), and adds a buffer cache / VM freeing piece.

Essentially this patch causes O_DIRECT I/O to not be left in the cache, but
does not prevent it from going through the cache, hence the 80%. For
the last 20% we need a method by which the I/O can be issued directly to
buffer supplied by the user process and bypass the buffer cache entirely,
but still maintain cache coherency.

I also have the code working under -stable but the changes made to sys/file.h
may not be MFCable, so an MFC is not on the table yet.

Submitted by: tegge, dillon


# 60fb0ce3 28-Apr-2001 Greg Lehey <grog@FreeBSD.org>

Revert consequences of changes to mount.h, part 2.

Requested by: bde


# 112f7372 25-Apr-2001 Kirk McKusick <mckusick@FreeBSD.org>

When closing the last reference to an unlinked file, it is freed
by the inactive routine. Because the freeing causes the filesystem
to be modified, the close must be held up during periods when the
filesystem is suspended.

For snapshots to be consistent across crashes, they must write
blocks that they copy and claim those written blocks in their
on-disk block pointers before the old blocks that they referenced
can be allowed to be written.

Close a loophole that allowed unwritten blocks to be skipped when
doing ffs_sync with a request to wait for all I/O activity to be
completed.


# d98dc34f 23-Apr-2001 Greg Lehey <grog@FreeBSD.org>

Correct #includes to work with fixed sys/mount.h.


# 602ef631 25-Mar-2001 Boris Popov <bp@FreeBSD.org>

Previous commit broke interlock locking for !LK_RETRY case.


# 71d8277b 25-Mar-2001 Boris Popov <bp@FreeBSD.org>

Prevent race condition by using msleep() instead of mtx_unlock()/tsleep().

Reviewed by: alfred


# 30632071 18-Mar-2001 Robert Watson <rwatson@FreeBSD.org>

o Rename "namespace" argument to "attrnamespace" as namespace is a C++
reserved word.

Submitted by: jkh
Obtained from: TrustedBSD Project


# 70f36851 14-Mar-2001 Robert Watson <rwatson@FreeBSD.org>

o Change the API and ABI of the Extended Attribute kernel interfaces to
introduce a new argument, "namespace", rather than relying on a first-
character namespace indicator. This is in line with more recent
thinking on EA interfaces on various mailing lists, including the
posix1e, Linux acl-devel, and trustedbsd-discuss forums. Two namespaces
are defined by default, EXTATTR_NAMESPACE_SYSTEM and
EXTATTR_NAMESPACE_USER, where the primary distinction lies in the
access control model: user EAs are accessible based on the normal
MAC and DAC file/directory protections, and system attributes are
limited to kernel-originated or appropriately privileged userland
requests.

o These API changes occur at several levels: the namespace argument is
introduced in the extattr_{get,set}_file() system call interfaces,
at the vnode operation level in the vop_{get,set}extattr() interfaces,
and in the UFS extended attribute implementation. Changes are also
introduced in the VFS extattrctl() interface (system call, VFS,
and UFS implementation), where the arguments are modified to include
a namespace field, as well as modified to advoid direct access to
userspace variables from below the VFS layer (in the style of recent
changes to mount by adrian@FreeBSD.org). This required some cleanup
and bug fixing regarding VFS locks and the VFS interface, as a vnode
pointer may now be optionally submitted to the VFS_EXTATTRCTL()
call. Updated documentation for the VFS interface will be committed
shortly.

o In the near future, the auto-starting feature will be updated to
search two sub-directories to the ".attribute" directory in appropriate
file systems: "user" and "system" to locate attributes intended for
those namespaces, as the single filename is no longer sufficient
to indicate what namespace the attribute is intended for. Until this
is committed, all attributes auto-started by UFS will be placed in
the EXTATTR_NAMESPACE_SYSTEM namespace.

o The default POSIX.1e attribute names for ACLs and Capabilities have
been updated to no longer include the '$' in their filename. As such,
if you're using these features, you'll need to rename the attribute
backing files to the same names without '$' symbols in front.

o Note that these changes will require changes in userland, which will
be committed shortly. These include modifications to the extended
attribute utilities, as well as to libutil for new namespace
string conversion routines. Once the matching userland changes are
committed, a buildworld is recommended to update all the necessary
include files and verify that the kernel and userland environments
are in sync. Note: If you do not use extended attributes (most people
won't), upgrading is not imperative although since the system call
API has changed, the new userland extended attribute code will no longer
compile with old include files.

o Couple of minor cleanups while I'm there: make more code compilation
conditional on FFS_EXTATTR, which should recover a bit of space on
kernels running without EA's, as well as update copyright dates.

Obtained from: TrustedBSD Project


# 608a3ce6 15-Feb-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Extend kqueue down to the device layer.

Backwards compatible approach suggested by: peter


# 9ed346ba 08-Feb-2001 Bosko Milekic <bmilekic@FreeBSD.org>

Change and clean the mutex lock interface.

mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)


# 1b367556 23-Jan-2001 Jason Evans <jasone@FreeBSD.org>

Convert all simplelocks to mutexes and remove the simplelock implementations.


# 936524aa 18-Nov-2000 Matthew Dillon <dillon@FreeBSD.org>

Implement a low-memory deadlock solution.

Removed most of the hacks that were trying to deal with low-memory
situations prior to now.

The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.

Code has been added to stall in a low-memory situation prior to a vnode
being locked.

Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.

Implement a number of VFS/BIO fixes

(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.

In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.

Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.

In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.

There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.

Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>


# 1d7e3e42 02-Nov-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Take VBLK devices further out of their missery.

This should fix the panic I introduced in my previous commit on this topic.


# 35e0e5b3 20-Oct-2000 John Baldwin <jhb@FreeBSD.org>

Catch up to moving headers:
- machine/ipl.h -> sys/ipl.h
- machine/mutex.h -> sys/mutex.h


# a18b1f1d 03-Oct-2000 Jason Evans <jasone@FreeBSD.org>

Convert lockmgr locks from using simple locks to using mutexes.

Add lockdestroy() and appropriate invocations, which corresponds to
lockinit() and must be called to clean up after a lockmgr lock is no
longer needed.


# 100d2c18 22-Sep-2000 Robert Watson <rwatson@FreeBSD.org>

o Introduce vn_extattr_rm(), a helper function in the style of
vn_extattr_get() and vn_extattr_set(). vn_extattr_rm() removes the
specified extended attribute from a vnode, authorizing the change as
the kernel (NULL cred).

Obtained from: TrustedBSD Project


# e81c5f43 04-Sep-2000 Robert Watson <rwatson@FreeBSD.org>

o vn_extattr_set() will now call appropriate vn_start_write() and
vn_finished_write() if IO_NODELOCKED is not set.

Obtained from: TrustedBSD Project


# e6a9ab52 08-Aug-2000 Robert Watson <rwatson@FreeBSD.org>

o Introduce vn_extattr_{get,set}, wrapper routines for VOP_GETEXTATTR
and VOP_SETEXTATTR to simplify calling from in-kernel consumers,
such as capability code. Both accept a vnode (optionally locked,
with ioflg to indicate that), attribute name, and a buffer + buffer
length in UIO_SYSSPACE. Both authorize the call as a kernel request,
with cred set to NULL for the actual VOP_ calls.

Obtained from: TrustedBSD Project


# 9b971133 23-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

This patch corrects the first round of panics and hangs reported
with the new snapshot code.

Update addaliasu to correctly implement the semantics of the old
checkalias function. When a device vnode first comes into existence,
check to see if an anonymous vnode for the same device was created
at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than
creating a new vnode for the device. This corrects a problem which
caused the kernel to panic when taking a snapshot of the root
filesystem.

Change the calling convention of vn_write_suspend_wait() to be the
same as vn_start_write().

Split out softdep_flushworklist() from softdep_flushfiles() so that
it can be used to clear the work queue when suspending filesystem
operations.

Access to buffers becomes recursive so that snapshots can recursively
traverse their indirect blocks using ffs_copyonwrite() when checking
for the need for copy on write when flushing one of their own indirect
blocks. This eliminates a deadlock between the syncer daemon and a
process taking a snapshot.

Ensure that softdep_process_worklist() can never block because of a
snapshot being taken. This eliminates a problem with buffer starvation.

Cleanup change in ffs_sync() which did not synchronously wait when
MNT_WAIT was specified. The result was an unclean filesystem panic
when doing forcible unmount with heavy filesystem I/O in progress.

Return a zero'ed block when reading a block that was not in use at
the time that a snapshot was taken. Normally, these blocks should
never be read. However, the readahead code will occationally read
them which can cause unexpected behavior.

Clean up the debugging code that ensures that no blocks be written
on a filesystem while it is suspended. Snapshots must explicitly
label the blocks that they are writing during the suspension so that
they do not cause a `write on suspended filesystem' panic.

Reorganize ffs_copyonwrite() to eliminate a deadlock and also to
prevent a race condition that would permit the same block to be
copied twice. This change eliminates an unexpected soft updates
inconsistency in fsck caused by the double allocation.

Use bqrelse rather than brelse for buffers that will be needed
soon again by the snapshot code. This improves snapshot performance.


# f2a2857b 11-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

Add snapshots to the fast filesystem. Most of the changes support
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.

Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).


# e6796b67 03-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

Move the truncation code out of vn_open and into the open system call
after the acquisition of any advisory locks. This fix corrects a case
in which a process tries to open a file with a non-blocking exclusive
lock. Even if it fails to get the lock it would still truncate the
file even though its open failed. With this change, the truncation
is done only after the lock is successfully acquired.

Obtained from: BSD/OS


# cb5ad9d3 25-Jun-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Fix stupid braino in last commit, initialize `vp' before we test vp->v_tag.

Spotted by: dillon


# c8bea19e 22-Jun-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Add a hack to fail registration of kq events on a non-ufs filesystem, as
support for those is non-existent at the moment.


# e3975643 25-May-2000 Jake Burkholder <jake@FreeBSD.org>

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


# 740a1973 23-May-2000 Jake Burkholder <jake@FreeBSD.org>

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


# 37d90a44 12-May-2000 Jeroen Ruigrok van der Werven <asmodai@FreeBSD.org>

Fix comment typo.

Submitted by: nrahlstr


# 9626b608 05-May-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by: peter


# 2c9b67a8 30-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unneeded #include <vm/vm_zone.h>

Generated by: src/tools/tools/kerninclude


# cb679c38 16-Apr-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Introduce kqueue() and kevent(), a kernel event notification facility.


# e4649cfa 01-Apr-2000 Matthew Dillon <dillon@FreeBSD.org>

Change the write-behind code to take more care when starting
async I/O's. The sequential read heuristic has been extended to
cover writes as well. We continue to call cluster_write() normally,
thus blocks in the file will still be reallocated for large (but still
random) I/O's, but I/O will only be initiated for truely sequential
writes.

This solves a number of annoying situations, especially with DBM (hash
method) writes, and also has the side effect of fixing a number of
(stupid) benchmarks.

Reviewed-by: mckusick


# ba4ad1fc 09-Jan-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Give vn_isdisk() a second argument where it can return a suitable errno.

Suggested by: bde


# bd5f5da9 09-Jan-2000 Kirk McKusick <mckusick@FreeBSD.org>

Add bwillwrite to all system calls that create things in the filesystem.
Benchmarks that create huge trees of empty files overwhelm the buffer cache.


# 762e6b85 15-Dec-1999 Eivind Eklund <eivind@FreeBSD.org>

Introduce NDFREE (and remove VOP_ABORTOP)


# 91921bd5 18-Nov-1999 Matthew Dillon <dillon@FreeBSD.org>

Ensure that garbage from the kernel stack does not wind up being
returned to user mode in the spare fields of the stat structure.

PR: kern/14966
Reviewed by: dillon@freebsd.org
Submitted by: Kelly Yancey kbyanc@posi.net


# b127fae4 07-Nov-1999 Peter Wemm <peter@FreeBSD.org>

Add a vnode fo_stat() entry point.


# 13ccadd4 19-Sep-1999 Brian Feldman <green@FreeBSD.org>

This is what was "fdfix2.patch," a fix for fd sharing. It's pretty
far-reaching in fd-land, so you'll want to consult the code for
changes. The biggest change is that now, you don't use
fp->f_ops->fo_foo(fp, bar)
but instead
fo_foo(fp, bar),
which increments and decrements the fp refcount upon entry and exit.
Two new calls, fhold() and fdrop(), are provided. Each does what it
seems like it should, and if fdrop() brings the refcount to zero, the
fd is freed as well.

Thanks to peter ("to hell with it, it looks ok to me.") for his review.
Thanks to msmith for keeping me from putting locks everywhere :)

Reviewed by: peter


# 85a219d2 09-Sep-1999 Julian Elischer <julian@FreeBSD.org>

Changes to centralise the default blocksize behaviour.
More likely to follow.

Submitted by: phk@freebsd.org


# 7012bab9 02-Sep-1999 Julian Elischer <julian@FreeBSD.org>

Revert a bunch of contraversial changes by PHK. After
a quick think and discussion among various people some form of some of
these changes will probably be recommitted.

The reversion requested was requested by dg while discussions proceed.
PHK has indicated that he can live with this, and it has been agreed
that some form of some of these changes may return shortly after further
discussion.


# de5f40af 31-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Improve the returned values in st_blksize a little bit, avoid
accessing union fields not valid for dev_t type.


# 02e15769 30-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Make bdev userland access work like cdev userland access unless
the highly non-recommended option ALLOW_BDEV_ACCESS is used.

(bdev access is evil because you don't get write errors reported.)

Kill si_bsize_best before it kills Matt :-)

Use the specfs routines rather having cloned copies in devfs.


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# b5fca1cb 27-Aug-1999 Brian Feldman <green@FreeBSD.org>

Add FIODTYPE ioctl for getting d_flags (type) info on a device.

Okayed by: phk


# a431597b 25-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Add a couple of missing but unimportant break; statements.


# 0232a251 13-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

oops: Add missing include.


# 3a965c0d 13-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Move the special-casing of stat(2)->st_blksize for device files
from UFS to the generic level. For chr/blk devices we don't care
about the blocksize of the filesystem, we want what the device
asked for.


# e32c66c5 04-Aug-1999 Brian Feldman <green@FreeBSD.org>

Fix fd race conditions (during shared fd table usage.) Badfileops is
now used in f_ops in place of NULL, and modifications to the files
are more carefully ordered. f_ops should also be set to &badfileops
upon "close" of a file.

This does not fix other problems mentioned in this PR than the first
one.

PR: 11629
Reviewed by: peter


# 67452993 26-Jul-1999 Alan Cox <alc@FreeBSD.org>

Add sysctl and support code to allow directories to be VMIO'd. The default
setting for the sysctl is OFF, which is the historical operation.

Submitted by: dillon


# ad8ac923 08-Jul-1999 Kirk McKusick <mckusick@FreeBSD.org>

These changes appear to give us benefits with both small (32MB) and
large (1G) memory machine configurations. I was able to run 'dbench 32'
on a 32MB system without bring the machine to a grinding halt.

* buffer cache hash table now dynamically allocated. This will
have no effect on memory consumption for smaller systems and
will help scale the buffer cache for larger systems.

* minor enhancement to pmap_clearbit(). I noticed that
all the calls to it used constant arguments. Making
it an inline allows the constants to propogate to
deeper inlines and should produce better code.

* removal of inherent vfs_ioopt support through the emplacement
of appropriate #ifdef's, with John's permission. If we do not
find a use for it by the end of the year we will remove it entirely.

* removal of getnewbufloops* counters & sysctl's - no longer
necessary for debugging, getnewbuf() is now optimal.

* buffer hash table functions removed from sys/buf.h and localized
to vfs_bio.c

* VFS_BIO_NEED_DIRTYFLUSH flag and support code added
( bwillwrite() ), allowing processes to block when too many dirty
buffers are present in the system.

* removal of a softdep test in bdwrite() that is no longer necessary
now that bdwrite() no longer attempts to flush dirty buffers.

* slight optimization added to bqrelse() - there is no reason
to test for available buffer space on B_DELWRI buffers.

* addition of reverse-scanning code to vfs_bio_awrite().
vfs_bio_awrite() will attempt to locate clusterable areas
in both the forward and reverse direction relative to the
offset of the buffer passed to it. This will probably not
make much of a difference now, but I believe we will start
to rely on it heavily in the future if we decide to shift
some of the burden of the clustering closer to the actual
I/O initiation.

* Removal of the newbufcnt and lastnewbuf counters that Kirk
added. They do not fix any race conditions that haven't already
been fixed by the gbincore() test done after the only call
to getnewbuf(). getnewbuf() is a static, so there is no chance
of it being misused by other modules. ( Unless Kirk can think
of a specific thing that this code fixes. I went through it
very carefully and didn't see anything ).

* removal of VOP_ISLOCKED() check in flushbufqueues(). I do not
think this check is necessary, the buffer should flush properly
whether the vnode is locked or not. ( yes? ).

* removal of extra arguments passed to getnewbuf() that are not
necessary.

* missed cluster_wbuild() that had to be a cluster_wbuild_wb() in
vfs_cluster.c

* vn_write() now calls bwillwrite() *PRIOR* to locking the vnode,
which should greatly aid flushing operations in heavy load
situations - both the pageout and update daemons will be able
to operate more efficiently.

* removal of b_usecount. We may add it back in later but for now
it is useless. Prior implementations of the buffer cache never
had enough buffers for it to be useful, and current implementations
which make more buffers available might not benefit relative to
the amount of sophistication required to implement a b_usecount.
Straight LRU should work just as well, especially when most things
are VMIO backed. I expect that (even though John will not like
this assumption) directories will become VMIO backed some point soon.

Submitted by: Matthew Dillon <dillon@backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>


# 8947a90a 02-Jul-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Make sure that stat(2) and friends always return a valid st_dev field.

Pseudo-FS need not fill in the va_fsid anymore, the syscall code
will use the first half of the fsid, which now looks like a udev_t
with major 255.


# 75c13541 28-Apr-1999 Poul-Henning Kamp <phk@FreeBSD.org>

This Implements the mumbled about "Jail" feature.

This is a seriously beefed up chroot kind of thing. The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.

For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact: "real virtual servers".

Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.

Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.

It generally does what one would expect, but setting up a jail
still takes a little knowledge.

A few notes:

I have no scripts for setting up a jail, don't ask me for them.

The IP number should be an alias on one of the interfaces.

mount a /proc in each jail, it will make ps more useable.

/proc/<pid>/status tells the hostname of the prison for
jailed processes.

Quotas are only sensible if you have a mountpoint per prison.

There are no privisions for stopping resource-hogging.

Some "#ifdef INET" and similar may be missing (send patches!)

If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!

Tools, comments, patches & documentation most welcome.

Have fun...

Sponsored by: http://www.rndassociates.com/
Run for almost a year by: http://www.servetheweb.com/


# f711d546 27-Apr-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Suser() simplification:

1:
s/suser/suser_xxx/

2:
Add new function: suser(struct proc *), prototyped in <sys/proc.h>.

3:
s/suser_xxx(\([a-zA-Z0-9_]*\)->p_ucred, \&\1->p_acflag)/suser(\1)/

The remaining suser_xxx() calls will be scrutinized and dealt with
later.

There may be some unneeded #include <sys/cred.h>, but they are left
as an exercise for Bruce.

More changes to the suser() API will come along with the "jail" code.


# f78fd73f 20-Apr-1999 Alan Cox <alc@FreeBSD.org>

Address several problems in vn_read and vn_write:

1. Make read-ahead work for pread and aio_read.

2. Fix one place where a comparison of uio_offset with -1
wasn't updated to use FOF_OFFSET.

3. Honor O_APPEND in the FOF_OFFSET case.

In addition, use the variable name "ioflag" in both vn_read and
vn_write to avoid possible confusion between the variable "flag"
and the parameter "flags".

Submitted by: Bruce Evans <bde@zeta.org.au> and me


# 8fe387ab 04-Apr-1999 Dmitrij Tejblum <dt@FreeBSD.org>

Add standard padding argument to pread and pwrite syscall. That should make them
NetBSD compatible.

Add parameter to fo_read and fo_write. (The only flag FOF_OFFSET mean that
the offset is set in the struct uio).

Factor out some common code from read/pread/write/pwrite syscalls.


# cde9bc87 26-Mar-1999 Alan Cox <alc@FreeBSD.org>

Changed vn_read/write such that fp->f_offset isn't touched
if uio->uio_offset != -1. This fixes a problem with aio_read/write
and permits a straightforward implementation of pread/pwrite.

PR: kern/8669
Submitted by: John Plevyak <jplevyak@inktomi.com>
Reviewed by: Matthew Dillon <dillon@apollo.backplane.com>


# 57c90d6f 29-Jan-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Use suser() to determine super-user-ness, don't examine cr_uid directly.


# 15a1057c 20-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

Add 'options DEBUG_LOCKS', which stores extra information in struct
lock, and add some macros and function parameters to make sure that
the information get to the point where it can be put in the lock
structure.

While I'm here, add DEBUG_VFS_LOCKS to LINT.


# fb116777 05-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

Remove the 'waslocked' parameter to vfs_object_create().


# f3d6ee09 01-Nov-1998 Peter Wemm <peter@FreeBSD.org>

Only do one VOP_ACCESS() per open() instead of two. This should reduce
the NFSv3 ACCESS RPC problems a little for busy clients that do a lot of
open/close. The nfs code could probably cache the results, but I'm not
sure whether this would be legal or useful. The problem is that with
a CPU farm, on each open there would be a lookup, getattr then access RPC
then the read/write RPC activity. Caching the access results probably
isn't going to help much if the clients access lots of files. Having the
nfs_access() routine interpret the getattr results is a bit of a hack, but
it's how NFSv2 is done and it might be OK for a mount attribute for v3.


# c259b8dd 27-Jun-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Report the mode as the result of the VOP_GETATTR rather than the
vnodes type, they may not correspond.


# ecbb00a2 07-Jun-1998 Doug Rabson <dfr@FreeBSD.org>

This commit fixes various 64bit portability problems required for
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.

The prototype FreeBSD/alpha machdep will follow in a couple of days
time.


# 7be2d300 06-May-1998 Mike Smith <msmith@FreeBSD.org>

In the words of the submitter:

---------
Make callers of namei() responsible for releasing references or locks
instead of having the underlying filesystems do it. This eliminates
redundancy in all terminal filesystems and makes it possible for stacked
transport layers such as umapfs or nullfs to operate correctly.

Quality testing was done with testvn, and lat_fs from the lmbench suite.

Some NFS client testing courtesy of Patrik Kudo.

vop_mknod and vop_symlink still release the returned vpp. vop_rename
still releases 4 vnode arguments before it returns. These remaining cases
will be corrected in the next set of patches.
---------

Submitted by: Michael Hancock <michaelh@cet.co.jp>


# 7c2e3d32 09-Apr-1998 Alexander Langer <alex@FreeBSD.org>

Grammar police.


# 5ddc8ded 08-Apr-1998 Wolfram Schneider <wosch@FreeBSD.org>

New mount option nosymfollow. If enabled, the kernel lookup()
function will not follow symbolic links on the mounted
file system and return EACCES (Permission denied).


# 100ceca2 06-Apr-1998 Peter Wemm <peter@FreeBSD.org>

Today is not my lucky day. Fix missing brace and I got a request
to use EMLINK instead.


# 193afe01 06-Apr-1998 Peter Wemm <peter@FreeBSD.org>

Use a different errno (ELOOP (as sef mentioned) since the text that goes
with the error sounds ok for the condition) if O_NOFOLLOW gets a link.


# 0fdc628b 06-Apr-1998 Peter Wemm <peter@FreeBSD.org>

Rather than let users get fd's to symlink files, make O_NOFOLLOW cause
an error if it gets a link (like it does if it gets a socket). The
implications of letting users try and do file operations on symlinks
themselves were too worrying.


# 7e3426aa 06-Apr-1998 Peter Wemm <peter@FreeBSD.org>

Implement a new open(2) flag: O_NOFOLLOW. This will instruct open
to not follow symlinks, but to open a handle on the link itself(!).
As strange as this might sound, it has several useful applications
safe race-free ways of opening files in hostile areas (eg: /tmp, a mode
1777 /var/mail, etc). It also would allow things like fchown() to work
on the link rather than having to implement a new syscall specifically for
that task.

Reviewed by: phk


# 6b16931c 24-Feb-1998 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# 0b08f5f7 05-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Back out DIAGNOSTIC changes.


# 47cfdb16 04-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Turn DIAGNOSTIC into a new-style option.


# 925a3a41 11-Jan-1998 John Dyson <dyson@FreeBSD.org>

Fix some vnode management problems, and better mgmt of vnode free list.
Fix the UIO optimization code.
Fix an assumption in vm_map_insert regarding allocation of swap pagers.
Fix an spl problem in the collapse handling in vm_object_deallocate.
When pages are freed from vnode objects, and the criteria for putting
the associated vnode onto the free list is reached, either put the
vnode onto the list, or put it onto an interrupt safe version of the
list, for further transfer onto the actual free list.
Some minor syntax changes changing pre-decs, pre-incs to post versions.
Remove a bogus timeout (that I added for debugging) from vn_lock.

PHK will likely still have problems with the vnode list management, and
so do I, but it is better than it was.


# 95e5e988 05-Jan-1998 John Dyson <dyson@FreeBSD.org>

Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.

When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.

When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.

A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.

Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.


# 60f8d464 28-Dec-1997 John Dyson <dyson@FreeBSD.org>

Fix the decl of vfs_ioopt, allow LFS to compile again, fix a minor problem
with the object cache removal.


# 2be70f79 28-Dec-1997 John Dyson <dyson@FreeBSD.org>

Lots of improvements, including restructring the caching and management
of vnodes and objects. There are some metadata performance improvements
that come along with this. There are also a few prototypes added when
the need is noticed. Changes include:

1) Cleaning up vref, vget.
2) Removal of the object cache.
3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore.
4) Correct some missing LK_RETRY's in vn_lock.
5) Correct the page range in the code for msync.

Be gentle, and please give me feedback asap.


# 2a024a2b 05-Dec-1997 Sean Eric Fagan <sef@FreeBSD.org>

Changes to allow event-based process monitoring and control.


# fd3bf775 28-Nov-1997 John Dyson <dyson@FreeBSD.org>

Fix and complete the AIO syscalls. There are some performance enhancements
coming up soon, but the code is functional. Docs will be forthcoming.


# 4a11ca4e 07-Nov-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Remove a bunch of variables which were unused both in GENERIC and LINT.

Found by: -Wunused


# 32e4d4c5 27-Oct-1997 Bruce Evans <bde@FreeBSD.org>

Use 127 instead of CHAR_MAX for the limit on the sequence count. The
limit doesn't have anything to do with characters. The count mainly
needs to fit in the VOP_READ() ioflag after being left shifted by 16.

Moved vn_lock() before vn_closefile(). vn_lock() was mismerged from
Lite2.

Removed some gratuitous braces.


# e7b0208f 05-Oct-1997 John Dyson <dyson@FreeBSD.org>

Relax the vnode locking for read only operations.


# a2f9bc72 13-Sep-1997 Peter Wemm <peter@FreeBSD.org>

vn_select -> vn_poll


# e4ba6a82 02-Sep-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# 42146e37 04-Apr-1997 Doug Rabson <dfr@FreeBSD.org>

[Previous comment was incorrect for these files]
Added calls to VFS lock debugging macros to make fixing filesystems' locking
easier.


# de15ef6a 04-Apr-1997 Doug Rabson <dfr@FreeBSD.org>

Add a function vop_sharedlock which a copy of vop_nolock without the
implementation #ifdef out. This can be used for now by NFS. As soon
as all the other filesystems' locking is fixed, this can go away.

Print the vnode address in vprint for easier debugging.


# 20982410 24-Mar-1997 Bruce Evans <bde@FreeBSD.org>

Don't include <sys/ioctl.h> in the kernel. Stage 4: include
<sys/ttycom.h> and sometimes <sys/filio.h> instead of <sys/ioctl.h>
in miscellaneous files. Most of these files have nothing to do
with ttys but need to include <sys/ttycom.h> to get the definitions
of TIOC[SG]PGRP which are (ab)used to convert F[SG]ETOWN fcntls into
ioctls.


# 3ac4d1ef 22-Mar-1997 Bruce Evans <bde@FreeBSD.org>

Don't #include <sys/fcntl.h> in <sys/file.h> if KERNEL is defined.
Fixed everything that depended on getting fcntl.h stuff from the wrong
place. Most things don't depend on file.h stuff at all.


# dfd0621a 08-Mar-1997 Guido van Rooij <guido@FreeBSD.org>

Fix style bugs and other bugs in the NFS fix.


# 324d42ad 07-Mar-1997 Gary Palmer <gpalmer@FreeBSD.org>

Fix (I hope) the NFS hole. This is only compile tested.

Submitted by: (partly) davids@SECNET.COM via BUGTRAQ


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 996c772f 09-Feb-1997 John Dyson <dyson@FreeBSD.org>

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 8b612c4b 28-Dec-1996 John Dyson <dyson@FreeBSD.org>

This commit is the embodiment of some VFS read clustering improvements.
Firstly, now our read-ahead clustering is on a file descriptor basis and not
on a per-vnode basis. This will allow multiple processes reading the
same file to take advantage of read-ahead clustering. Secondly, there
previously was a problem with large reads still using the ramp-up
algorithm. Of course, that was bogus, and now we read the entire
"chunk" off of the disk in one operation. The read-ahead clustering
algorithm should use less CPU than the previous also (I hope :-)).

NOTE: THAT LKMS MUST BE REBUILT!!!


# 6476c0d2 21-Aug-1996 John Dyson <dyson@FreeBSD.org>

Even though this looks like it, this is not a complex code change.
The interface into the "VMIO" system has changed to be more consistant
and robust. Essentially, it is now no longer necessary to call vn_open
to get merged VM/Buffer cache operation, and exceptional conditions
such as merged operation of VBLK devices is simpler and more correct.

This code corrects a potentially large set of problems including the
problems with ktrace output and loaded systems, file create/deletes,
etc.

Most of the changes to NFS are cosmetic and name changes, eliminating
a layer of subroutine calls. The direct calls to vput/vrele have
been re-instituted for better cross platform compatibility.

Reviewed by: davidg


# edbfedac 11-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all
files are off the vendor branch, so this should not change anything.

A "U" marker generally means that the file was not changed in between
the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally
means that there was a change.
[note new unused (in this form) syscalls.conf, to be 'cvs rm'ed]


# 0f20dc94 08-Mar-1996 John Dyson <dyson@FreeBSD.org>

Remove a now unnecessary function prototype.


# 91477adc 01-Mar-1996 John Dyson <dyson@FreeBSD.org>

Enable VMIO for non-VDIR metadata and block device.


# bd7e5f99 18-Jan-1996 John Dyson <dyson@FreeBSD.org>

Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.


# 27a0b398 17-Dec-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Staticize.
Unstaticize a function in scsi/scsi_base that was used, with an undocumented
option.
My last count on the LINT kernel shows:
Total symbols: 3647
unref symbols: 463
undef symbols: 4
1 ref symbols: 1751
2 ref symbols: 485
Approaching the pain threshold now.


# a316d390 10-Dec-1995 John Dyson <dyson@FreeBSD.org>

Changes to support 1Tb filesizes. Pages are now named by an
(object,index) pair instead of (object,offset) pair.


# efeaf95a 06-Dec-1995 David Greenman <dg@FreeBSD.org>

Untangled the vm.h include file spaghetti.


# d68a4190 22-Oct-1995 David Greenman <dg@FreeBSD.org>

Moved the filesystem read-only check out of the syscalls and into the
filesystem layer, as was done in lite-2. Merged in some other cosmetic
changes while I was at it. Rewrote most of msdosfs_access() to be more
like ufs_access() and to include the FS read-only check.

Obtained from: partially from 4.4BSD-lite2


# e83e1865 06-Oct-1995 Poul-Henning Kamp <phk@FreeBSD.org>

A little hack to avoid a 64bit divide. Can go away if Gcc ever learns to
optimise 64bit stuff...


# 24aa09cd 20-Jul-1995 David Greenman <dg@FreeBSD.org>

vnode_pager_alloc() never returns NULL, so don't check for it.


# 97e15667 16-Jul-1995 Bruce Evans <bde@FreeBSD.org>

Don't include <sys/tty.h> in drivers that aren't tty drivers or in general
files that don't depend on the internals of <sys/tty.h>


# 24a1cce3 13-Jul-1995 David Greenman <dg@FreeBSD.org>

NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!

Much needed overhaul of the VM system. Included in this first round of
changes:

1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".

2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.

3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.

4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.

5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.

6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.

7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.

8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.

9) Some almost useless debugging code removed.

10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.

11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.

12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).

13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.

14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)

TODO:

1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.

2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.

3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.

4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.

5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).


# 06cb7259 09-Jul-1995 David Greenman <dg@FreeBSD.org>

Moved call to VOP_GETATTR() out of vnode_pager_alloc() and into the places
that call vnode_pager_alloc() so that a failure return can be dealt with.
This fixes a panic seen on NFS clients when a file being opened is deleted
on the server before the open completes.


# 6663c3d5 27-Jun-1995 David Greenman <dg@FreeBSD.org>

Removed extra semicolon.


# aa2cabb9 27-Jun-1995 David Greenman <dg@FreeBSD.org>

1) Converted v_vmdata to v_object.
2) Removed unnecessary vm_object_lookup()/pager_cache(object, TRUE) pairs
after vnode_pager_alloc() calls - the object is already guaranteed to be
persistent.
3) Removed some gratuitous casts.


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# 54ea0c00 10-May-1995 David Greenman <dg@FreeBSD.org>

Unlock the vnode before sleeping on an OBJ_DEAD object. Should fix Bruce's
hang. Fixed some formatting anomolies and removed some unneeded casts.


# 50475e8b 18-Mar-1995 David Greenman <dg@FreeBSD.org>

Removed unnecessary call to vnode_pager_uncache(). We automatically clear
the VTEXT flag after all mappers have finished with the object.


# 9977034c 13-Feb-1995 Poul-Henning Kamp <phk@FreeBSD.org>

YFfix


# 0d94caff 09-Jan-1995 David Greenman <dg@FreeBSD.org>

These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.

The majority of the merged VM/cache work is by John Dyson.

The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.

vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.

vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.

vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.

vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.

vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.

pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.

vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.

proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.

swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.

machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.

machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.

ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.

Submitted by: John Dyson and David Greenman


# 8e58bf68 05-Oct-1994 David Greenman <dg@FreeBSD.org>

Stuff object into v_vmdata rather than pager. Not important which at
the moment, but will be in the future. Other changes mostly cosmetic,
but are made for future VMIO considerations.

Submitted by: John Dyson


# 797f2d22 02-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

All of this is cosmetic. prototypes, #includes, printfs and so on. Makes
GCC a lot more silent.


# 605f11c8 17-Aug-1994 David Greenman <dg@FreeBSD.org>

Moved over my fix for vnode lossage when multiple TIOCSCTTY ioctls are
done. This patch was extended to also include a suggested change by
Kirk McKusick which allows the control tty to be reasigned to a different
tty without losing a vnode.


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources