Cross Reference: /freebsd-current/sys/kern/vfs

History log of /freebsd-current/sys/kern/vfs_vnops.c
Revision	Date	Author	Comments
# 473c90ac	10-May-2024	John Baldwin <jhb@FreeBSD.org>	uio: Use switch statements when handling UIO_READ vs UIO_WRITE This is mostly to reduce the diff with CheriBSD which adds additional constants to enum uio_rw, but also matches the normal style used for uio_segflg. Reviewed by: kib, emaste Obtained from: CheriBSD Differential Revision: https://reviews.freebsd.org/D45142
# 08f3d5b6	04-Apr-2024	Mark Johnston <markj@FreeBSD.org>	copy_file_range: Call vn_rdwr() at least once This ensures that we invoke VOP_READ on the input file even if it's empty, which in turn helps ensure that filesystems update the atime of the file. PR: 274615 Reviewed by: olce, rmacklem, kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D43524
# 89f1dcb3	14-Mar-2024	Rick Macklem <rmacklem@FreeBSD.org>	vfs_vnops.c: Use va_bytes >= va_size hint to avoid SEEK_DATA/SEEKHOLE vn_generic_copy_file_range() tries to maintain holes in file ranges being copied, using SEEK_DATA/SEEK_HOLE where possible, Unfortunately SEEK_DATA/SEEK_HOLE operations can take a long time under certain circumstances. Although it is not currently possible to know if a file has unallocated data regions, the case where va_bytes >= va_size is a strong hint that there are no unallocated data regions. This hint does not work well for file systems doing compression, but since it is only a hint, it is still useful. For the case of va_bytes >= va_size, avoid doing SEEK_DATA/SEEK_HOLE. Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D44509
# fa26f46d	23-Feb-2024	Jason A. Harmening <jah@FreeBSD.org>	vn_lock_pair(): allow lkflags1/lkflags2 to be 0 if vp1/vp2 is NULL It's a bit strange to require the caller to pass contrived lock flags if the corresponding vnode is NULL, simply to appease the assertion that exactly one of LK_SHARED or LK_EXCLUSIVE must be set. On the other hand, we still want to catch cases in which completely bogus or corrupt flags are passed even if the corresponding vnode is NULL. Therefore, specifically allow empty flags for lkflags1/lkflags2 iff the respective vp1/vp2 param is NULL. Reviewed by: kib, olce MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D44046
# 61cc4830	18-Jan-2024	Alfredo Mazzinghi <am2419@cl.cam.ac.uk>	Abstract UIO allocation and deallocation. Introduce the allocuio() and freeuio() functions to allocate and deallocate struct uio. This hides the actual allocator interface, so it is easier to modify the sub-allocation layout of struct uio and the corresponding iovec array. Obtained from: CheriBSD Reviewed by: kib, markj MFC after: 2 weeks Sponsored by: CHaOS, EPSRC grant EP/V000292/1 Differential Revision: https://reviews.freebsd.org/D43711
# f04220c1	19-Jan-2024	Konstantin Belousov <kib@FreeBSD.org>	kcmp(2): implement for vnode files Reviewed by: brooks, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D43518
# b068bb09	07-Jan-2024	Konstantin Belousov <kib@FreeBSD.org>	Add vnode_pager_clean_{a,}sync(9) Bump __FreeBSD_version for ZFS use. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D43356
# 2319ca6a	31-Dec-2023	Rick Macklem <rmacklem@FreeBSD.org>	vfs_vnops.c: Fix vn_generic_copy_file_range() for truncation When copy_file_range(2) was first being developed, inoffp + len had to be <= infile_size or an error was returned. This semantic (as defined by Linux) changed to allow inoffp + len to be greater than infile_size and the copy would end at inoffp + infile_size. Unfortunately, the code that decided if the outfd should be truncated in length did not get updated for this semantics change. As such, if a copy_file_range(2) is done, where infile_size - inoffp is less that outfile_size but len is large, the outfd file is truncated when it should not be. (The semantics for this for Linux is to not truncate outfd in this case.) This patch fixes the problem. I believe the calculation is safe for all non-negative values of outsize, outoffp, inoffp and insize, which should be ok, since they are all guaranteed to be non-negative. Note that this bug is not observed over NFSv4.2, since it truncates len to infile_size - *inoffp. PR: 276045 Reviewed by: asomers, kib MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D43258
# c5405d1c	18-Nov-2023	Konstantin Belousov <kib@FreeBSD.org>	vn_copy_file_range(): provide ENOSYS fallback to vn_generic_copy_file_range() Reviewed by: markj, Olivier Certner <olce.freebsd@certner.fr> Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D42603
# a9bc8637	18-Nov-2023	Konstantin Belousov <kib@FreeBSD.org>	vn_copy_file_range(): find write vnodes on which to call the VOP Reviewed by: markj, Olivier Certner <olce.freebsd@certner.fr> Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D42603
# 29363fb4	23-Nov-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove ancient SCCS tags. Remove ancient SCCS tags from the tree, automated scripting, with two minor fixup to keep things compiling. All the common forms in the tree were removed with a perl script. Sponsored by: Netflix
# 305a2676	19-Nov-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: dodge locking for lseek(fd, 0, SEEK_CUR) It is very common and according to dtrace while running poudriere almost all calls with SEEK_CUR pass 0.
# 22bac49b	16-Nov-2023	Konstantin Belousov <kib@FreeBSD.org>	vn_lock_pair(): reasonably handle vp1 == vp2 case Lock the vnode in the most exclusive lock mode requested, once. All callers already ensure that vp1 != vp2 or are careful enough to only unlock once otherwise. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D42642
# 23210f53	12-Nov-2023	Konstantin Belousov <kib@FreeBSD.org>	vn_copy_file_range(): busy both in and out mp around call to VOP_COPY_FILE_RANGE() This is required e.g. for nullfs to ensure liveness of the lower mount points. Reviewed by: jah, rmacklem, Olivier Certner <olce.freebsd@certner.fr> Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D42554
# 89188bd6	12-Nov-2023	Konstantin Belousov <kib@FreeBSD.org>	vn_copy_file_range(): use local variables for invp/outvp vnodes v_mounts This avoids possible NULL dereference when checking mnt_vfc names. Reviewed by: jah, rmacklem, Olivier Certner <olce.freebsd@certner.fr> Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D42554
# 969071be	06-Sep-2023	Martin Matuska <mm@FreeBSD.org>	vfs: copy_file_range() between multiple mountpoints of the same fs type VOP_COPY_FILE_RANGE(9) is now caled when source and target vnodes reside on the same filesystem type (not just on the same mountpoint). The check if vnodes are on the same mountpoint must be done in the filesystem code. There are currently only three users - fusefs(5) already has this check, ZFS can handle multiple mountpoints and a check has been added to NFS client. ZFS block cloning is now possible between all snapshots and datasets of the same ZFS pool. MFC after: 1 week Reviewed by: rmacklem Differential Revision: https://reviews.freebsd.org/D41721
# 685dc743	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove $FreeBSD$: one-line .c pattern Remove /^[\s]__FBSDID$"\$FreeBSD\$"$;?\s*\n/
# 821dec4d	05-Aug-2023	Konstantin Belousov <kib@FreeBSD.org>	vnode io: request range-locking when pgcache reads are enabled PR: 272678 Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D41334
# 651fdc3d	05-Aug-2023	Konstantin Belousov <kib@FreeBSD.org>	Revert "vnode read(2)/write(2): acquire rangelock regardless of do_vn_io_fault()" This reverts commit 5b353925ff61b9ddb97bb453ba75278b578ed7d9. The reason is the lesser scalability of the vnode' rangelock comparing with the vnode lock. Requested by: mjg Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D41334
# 5b353925	23-Jul-2023	Konstantin Belousov <kib@FreeBSD.org>	vnode read(2)/write(2): acquire rangelock regardless of do_vn_io_fault() To ensure atomicity of reads against parallel writes and truncates, vnode lock was not enough at least since introduction of vn_io_fault(). That code only take rangelock when it was possible that vn_read() and vn_write() could drop the vnode lock. At least since the introduction of VOP_READ_PGCACHE() which generally does not lock the vnode at all, rangelocks become required even for filesystems that do not need vn_io_fault() workaround. For instance, tmpfs. PR: 272678 Analyzed and reviewed by: Andrew Gierth <andrew@tao11.riddles.org.uk> Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D41158
# e38c634b	19-Jul-2023	Dmitry Chagin <dchagin@FreeBSD.org>	vfs: Add a parenthese to vn_lock_pair() asserts to silence gcc Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D41070
# f5837839	09-Jul-2023	Olivier Certner <olce.freebsd@certner.fr>	vn_lock_pair(): Support passing LK_NODDLKTREAT Since this function ultimately calls vn_lock() or VOP_LOCK1(), allows it to receive and pass this flag which is used in the lookup code and doesn't interfere with the function's operation. Reviewed by: kib, markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D40954
# 2544b8e0	28-Apr-2023	Olivier Certner <olce.freebsd@certner.fr>	vfs: Rename vfs_emptydir() to vn_dir_check_empty() No functional change. While here, adapt comments to style(9). Reviewed by: kib MFC after: 1 week
# c21d87a8	28-Apr-2023	Olivier Certner <olce.freebsd@certner.fr>	vfs: vn_dir_next_dirent(): Adapt comments to style(9) No functional change. Reviewed by: kib MFC after: 1 week
# 3d8450db	24-Apr-2023	Olivier Certner <olce.freebsd@certner.fr>	vfs: vn_dir_next_dirent(): Simplify interface and harden Simplify the old interface (one less argument, simpler termination test) and add documentation about it. Add more sanity checks (mostly under INVARIANTS, but also in the general case to prevent infinite loops). Drop the explicit test on minimum directory entry size (without INVARIANTS). Deal with the impacts in callers (dirent_exists() and vop_stdvptocnp()). dirent_exists() has been simplified a bit, preserving the exact same semantics but for the return code whose meaning has been reversed (0 now means the entry exists, ENOENT that it doesn't and other values are genuine errors). While here, suppress gratuitous casts of malloc return values. vn_dir_next_dirent() has been tested by a 'make -j4 buildkernel' with a temporary modification to the VFS cache causing vn_vptocnp() to always call VOP_VPTOCNP() and finally vop_stdvptocnp() (observed with temporary debug counters). Export new _GENERIC_MINDIRSIZ and _GENERIC_MAXDIRSIZ on __BSD_VISIBLE, and GENERIC_MINDIRSIZ and GENERIC_MAXDIRSIZ on _KERNEL. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D39764
# 6bce3f23	23-Apr-2023	Olivier Certner <olce.freebsd@certner.fr>	vfs: Export get_next_dirent() as vn_dir_next_dirent() Move internal-to-'vfs_default.c' get_next_dirent() to 'vfs_vnops.c' and export it for use by other parts of the VFS. This is a preparatory change for using it in vfs_emptydir(). No functional change. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D39755
# 37e6036b	24-Apr-2023	Konstantin Belousov <kib@FreeBSD.org>	VOP_CLOSE(): MNTK_EXTENDED_SHARED filesystems do not need excl lock All in-tree implementations of VOP_CLOSE() for filesystems proclaiming MNTK_EXTENDED_SHARED, are fine with the shared lock for the closed vnode. I checked the following implementations: ffs ext2 ufs null tmpfs devfs fdescfs cd9660 zfs It seems that initial addition of FWRITE check was due to necessity of handling the VV_TEXT vnode vflag. Since VOP_ADD_WRITECOUNT() only requires shared lock, we can relax the locking requirement there. Reviewed by: markj, Olivier Certner <olce.freebsd@certner.fr> Tested by: Olivier Certner Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D39784
# 6a5e6140	24-Apr-2023	Olivier Certner <olce.freebsd@certner.fr>	vn_open_vnode(): fix locking around VOP_CLOSE() on advisory lock error In the case of a FIFO or if trying to open a file for writing, an exclusive lock is necessary. Reviewed by: kib MFC after: 1 week
# afa8f897	05-Apr-2023	Konstantin Belousov <kib@FreeBSD.org>	vn_start_write(): consistently set mpp to NULL on error or after failed sleep This ensures that mpp != NULL iff vn_finished_write() should be called, regardless of the returned error, except for V_NOWAIT. The only exception that must be maintained is the case where vn_start_write(V_NOWAIT) is called with the intent of later dropping other locks and then doing vn_start_write(V_XSLEEP), which needs the mp value calculated from the non-waitable call above it. Also note that V_XSLEEP is not supported by vn_start_secondary_write(). Reviewed by: markj, mjg (previous version), rmacklem (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D39441
# b2f32887	09-Apr-2023	Konstantin Belousov <kib@FreeBSD.org>	vn_start_write(): minor style Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D39441
# bb24eaea	05-Apr-2023	Konstantin Belousov <kib@FreeBSD.org>	vn_lock_pair(): allow to request shared locking If either of vnodes is shared locked, lock must not be recursed. Requested by: rmacklem Reviewed by: markj, rmacklem Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D39444
# 3b605620	03-Feb-2023	Konstantin Belousov <kib@FreeBSD.org>	FIOSEEKHOLE/FIOSEEKDATA: correct consistency for bmap-based implementation Writes on UFS through a mapped region do not allocate disk blocks in holes immediately. The blocks are allocated when the pages are paged out first time. This breaks the algorithm in vn_bmap_seekhole() and ufs_bmap_seekdata(), because VOP_BMAP() reports hole for the place which already contains a valid data. Clean the pages before doing VOP_BMAP() in the affected functions. In principle, we could clean less by only requesting clean starting from the offset, but it is probably not very important. PR: 269261 Reported by: asomers Reviewed by: asomers, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D38379
# dec7db49	23-Jan-2023	Jiajie Chen <c@jia.je>	Add kf_file_nlink field to kf_file and populate it This will allow user-space programs (e.g. lsof) to locate deleted files whose nlink equals zero. Prior to this commit, programs has to use stat(kf_path) to get nlink, but that will fail if the file is deleted. [mjg: s/fail/file in the commit message] Reviewed by: mjg Differential Revision: https://reviews.freebsd.org/D38169
# 456f0575	19-Jan-2023	Konstantin Belousov <kib@FreeBSD.org>	Handle int rank issues in in vn_getsize_locked() and vn_seek() In vn_getsize_locked(), when storing vattr.va_size of type u_quad_t into off_t size, we must avoid overflow. Then, the check for fsize < 0, introduced in the commit f45feecfb27ca51067d6789eaa43547cadc4990b 'vfs: add vn_getsize', is nop [1]. Reported and reviewed by: jhb Coverity CID: 1502346 Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D38133
# f45feecf	22-Sep-2022	Mateusz Guzik <mjg@FreeBSD.org>	vfs: add vn_getsize getattr is very expensive and in important cases only gets called to get the size. This can be optimized with a dedicated routine which obtains that statistic. As a step towards that goal make size-only consumers use a dedicated routine. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D37885
# 4ee16246	16-Nov-2022	Rick Macklem <rmacklem@FreeBSD.org>	vfs_vnops.c: Fix blksize for ZFS Since ZFS reports _PC_MIN_HOLE_SIZE as 512 (although it appears that an unwritten region must be at least f_iosize to remain unallocated), vn_generic_copy_file_range() uses 4096 for the copy blksize for ZFS, reulting in slow copies. For most other file systems, _PC_MIN_HOLE_SIZE and f_iosize are the same value, so this patch modifies the code to use f_iosize for most cases. It also documents in comments why the blksize is being set a certain way, so that the code does not appear to be doing "magic math". Reported by: allanjude Reviewed by: allanjude, asomers MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D37076
# 33ce1788	17-Oct-2022	Konstantin Belousov <kib@FreeBSD.org>	vn_bmap_seekhole: check that passed offset is non-negative Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D37024
# 52360ca3	25-Sep-2022	Alan Somers <asomers@FreeBSD.org>	copy_file_range: truncate write if it would exceed RLIMIT_FSIZE PR: 266611 MFC after: 2 weeks Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D36706
# 1b4b7517	18-Sep-2022	Konstantin Belousov <kib@FreeBSD.org>	Add vn_rlimit_fsizex() and vn_rlimit_fsizex_res() The vn_rlimit_fsizex() function: - checks that the write does not exceed RLIMIT_FSIZE limit and fs maximum supported file size - truncates write length if it exceeds the RLIMIT_FSIZE or max file size, but there are some bytes to write - sends SIGXFSZ if RLIMIT_FSIZE would be exceed otherwise POSIX mandates the truncated write in case when some bytes can be written but whole write request fails the RLIMIT_FSIZE check. The function is supposed to be used from VOP_WRITE()s. Due to pecularity in the VFS generic write syscall layer, uio_resid must correctly reflect the written amount (noted by markj). Provide the dual vn_rlimit_fsizex_res() function to correct uio_resid after the clamp done in vn_rlimit_fsizex() on VOP_WRITE() return. PR: 164793 Reviewed by: asomers, jah, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D36625
# 2ac083f6	18-Sep-2022	Konstantin Belousov <kib@FreeBSD.org>	Add vn_rlimit_trunc() Reviewed by: asomers, jah, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D36625
# cc65a412	18-Sep-2022	Konstantin Belousov <kib@FreeBSD.org>	filesystems: return error from vn_rlimit_fsize() instead of EFBIG Reviewed by: asomers, jah, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D36625
# a75d1ddd	17-Sep-2022	Mateusz Guzik <mjg@FreeBSD.org>	vfs: introduce V_PCATCH to stop abusing PCATCH
# 9e4f35ac	17-Sep-2022	Mateusz Guzik <mjg@FreeBSD.org>	vfs: deperl msleep flag calculation in vn_start_*write
# a755fb92	10-Sep-2022	Mateusz Guzik <mjg@FreeBSD.org>	vfs: retire the V_MNTREF flag Reviewed by: kib, mckusick Differential Revision: https://reviews.freebsd.org/D36521
# eb9a1f9c	16-Aug-2022	Mateusz Guzik <mjg@FreeBSD.org>	vfs: plug dead store in vn_io_fault1 Reported by: clang --analyze
# 8c9aa94b	23-Jul-2022	Ka Ho Ng <khng@FreeBSD.org>	Convert runtime param checks to KASSERTs for fo_fspacectl Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D35880
# b7262756	02-Apr-2022	Mateusz Guzik <mjg@FreeBSD.org>	vfs: fixup WANTIOCTLCAPS on open In some cases vn_open_cred overwrites cn_flags, effectively nullifying initialisation done in NDINIT. This will have to be fixed. In the meantime make sure the flag is passed. Reported by: jenkins Noted by: Mathieu <sigsys@gmail.com>
# bb92cd7b	24-Mar-2022	Mateusz Guzik <mjg@FreeBSD.org>	vfs: NDFREE(&nd, NDF_ONLY_PNBUF) -> NDFREE_PNBUF(&nd)
# d28af1ab	15-Nov-2021	Mark Johnston <markj@FreeBSD.org>	vm: Add a mode to vm_object_page_remove() which skips invalid pages This will be used to break a deadlock in ZFS between the per-mountpoint teardown lock and page busy locks. In particular, when purging data from the page cache during dataset rollback, we want to avoid blocking on the busy state of invalid pages since the busying thread may be blocked on the teardown lock in zfs_getpages(). Add a helper, vn_pages_remove_valid(), for use by filesystems. Bump __FreeBSD_version so that the OpenZFS port can make use of the new helper. PR: 258208 Reviewed by: avg, kib, sef Tested by: pho (part of a larger patch) MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32931
# f0c9847a	06-Nov-2021	Rick Macklem <rmacklem@FreeBSD.org>	vfs: Add "ioflag" and "cred" arguments to VOP_ALLOCATE When the NFSv4.2 server does a VOP_ALLOCATE(), it needs the operation to be done for the RPC's credential and not td_ucred. It also needs the writing to be done synchronously. This patch adds "ioflag" and "cred" arguments to VOP_ALLOCATE() and modifies vop_stdallocate() to use these arguments. The VOP_ALLOCATE.9 man page will be patched separately. Reviewed by: khng, kib Differential Revision: https://reviews.freebsd.org/D32865
# 2b68eb8e	01-Oct-2021	Mateusz Guzik <mjg@FreeBSD.org>	vfs: remove thread argument from VOP_STAT and fo_stat.
# b4a58fbf	01-Oct-2021	Mateusz Guzik <mjg@FreeBSD.org>	vfs: remove cn_thread It is always curthread. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D32453
# d71e1a88	25-Sep-2021	Mateusz Guzik <mjg@FreeBSD.org>	fifo: support flock This evens it up with Linux. Original patch by: Greg V <greg@unrelenting.technology> Differential Revision: https://reviews.freebsd.org/D24255#565302
# 2bd98269	16-Sep-2021	Mark Johnston <markj@FreeBSD.org>	vfs: Permit unix sockets to be opened with O_PATH As with FIFOs, a path descriptor for a unix socket cannot be used with kevent(). In principle connectat(2) and bindat(2) could be modified to support an AT_EMPTY_PATH-like mode which operates on the socket referenced by an O_PATH fd referencing a unix socket. That would eliminate the path length limit imposed by sockaddr_un. Update O_PATH tests. Reviewed by: kib MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31970
# c5128c48	07-Sep-2021	Rick Macklem <rmacklem@FreeBSD.org>	VOP_COPY_FILE_RANGE: Add a COPY_FILE_RANGE_TIMEO1SEC flag Although it is not specified in the RFCs, the concept that the NFSv4 server should reply to an RPC request within a reasonable time is accepted practice within the NFSv4 community. Without this patch, the NFSv4.2 server attempts to reply to a Copy operation within 1second by limiting the copy to vfs.nfs.maxcopyrange bytes (default 10Mbytes). This is crude at best, given the large variation in I/O subsystem performance. This patch adds a kernel only flag COPY_FILE_RANGE_TIMEO1SEC that the NFSv4.2 can specify, which tells VOP_COPY_FILE_RANGE() to return after approximately 1 second with a partial result and implements this in vn_generic_copy_file_range(), used by vop_stdcopyfilerange(). Modifying the NFSv4.2 server to set this flag will be done in a separate patch. Also under consideration is exposing the COPY_FILE_RANGE_TIMEO1SEC to userland for use on the FreeBSD copy_file_range(2) syscall. MFC after: 2 weeks Reviewed by: khng Differential Revision: https://reviews.freebsd.org/D31829
# 92bb74fd	01-Sep-2021	Ka Ho Ng <khng@FreeBSD.org>	vfs: Use file_cred for VOP_DEALLOCATE in vn_deallocate if non-NULL This changes vn_deallocate() to match the behavior of vn_rdwr() when picking which ucred to use. That is, vn_deallocate() uses file_cred for making VOP call if it is non-NULL, or use active_cred otherwise. Sponsored by: The FreeBSD Foundation Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D31712
# a58e222b	29-Aug-2021	Ka Ho Ng <khng@FreeBSD.org>	vfs: yield in vn_deallocate_impl() loop Yield at the end of each loop iteration if there are remaining works as indicated by the value of *len updated by VOP_DEALLOCATE. Without this, when calling vop_stddeallocate to zero a large region, the implementation only zerofills a relatively small chunk and returns. Sponsored by: The FreeBSD Foundation Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D31705
# a638dc4e	12-Aug-2021	Ka Ho Ng <khng@FreeBSD.org>	vfs: Add ioflag to VOP_DEALLOCATE(9) The addition of ioflag allows callers passing IO_SYNC/IO_DATASYNC/IO_DIRECT down to the file system implementation. The vop_stddeallocate fallback implementation is updated to pass the ioflag to the file system implementation. vn_deallocate(9) internally is also changed to pass ioflag to the VOP_DEALLOCATE call. Sponsored by: The FreeBSD Foundation Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D31500
# c15384f8	12-Aug-2021	Ka Ho Ng <khng@FreeBSD.org>	vfs: Add get_write_ioflag helper to calculate ioflag Converted vn_write to use this helper. Sponsored by: The FreeBSD Foundation MFC after: 3 days Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D31513
# 4a9b832a	11-Aug-2021	Ka Ho Ng <khng@FreeBSD.org>	vfs: Rename ioflg to ioflag in vn_deallocate This includes a style fix around ioflag checking as well. Sponsored by: The FreeBSD Foundation Reviewed by: kib, bcr Differential Revision: https://reviews.freebsd.org/D31505
# c18c74a8	06-Aug-2021	Rick Macklem <rmacklem@FreeBSD.org>	namei: Add cn_flags bits for OPENREAD and OPENWRITE VOP_LOOKUP() is called with cn_flags bits ISLASTCN and ISOPEN to indicate that the lookup is for the last component of a pathname when doing open. If the cn_flags also indicates if the open is for Reading, Writing or Both, the NFSv4 client can do an NFSv4 Open operation in the same compound RPC as Lookup, often avoiding the additional Open RPC now done when VOP_OPEN() is called. This patch defines two new cn_flags bits called OPENREAD and OPENWRITE and sets these in open2nameif() based on FREAD, FWRITE flag bits. This will allow a subsequent patch to the NFSv4 client to do the Open operation in the same RPC as Lookup. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D31431
# 0dc332bf	05-Aug-2021	Ka Ho Ng <khng@FreeBSD.org>	Add fspacectl(2), vn_deallocate(9) and VOP_DEALLOCATE(9). fspacectl(2) is a system call to provide space management support to userspace applications. VOP_DEALLOCATE(9) is a VOP call to perform the deallocation. vn_deallocate(9) is a public KPI for kmods' use. The purpose of proposing a new system call, a KPI and a VOP call is to allow bhyve or other hypervisor monitors to emulate the behavior of SCSI UNMAP/NVMe DEALLOCATE on a plain file. fspacectl(2) comprises of cmd and flags parameters to specify the space management operation to be performed. Currently cmd has to be SPACECTL_DEALLOC, and flags has to be 0. fo_fspacectl is added to fileops. VOP_DEALLOCATE(9) is added as a new VOP call. A trivial implementation of VOP_DEALLOCATE(9) is provided. Sponsored by: The FreeBSD Foundation Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D28347
# abbb57d5	04-Aug-2021	Ka Ho Ng <khng@FreeBSD.org>	vfs: Introduce vn_bmap_seekhole_locked() vn_bmap_seekhole_locked() is factored out version of vn_bmap_seekhole(). This variant requires shared vnode lock being held around the call. Sponsored by: The FreeBSD Foundation Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D31404
# 0ef5eee9	03-Aug-2021	Konstantin Belousov <kib@FreeBSD.org>	Add vn_lktype_write() and remove repetetive code that calculates vnode locking type for write. Reviewed by: khng, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D31405
# a19ae1b0	02-Jun-2021	Rich Ercolani <rincebrain@gmail.com>	vfs: fix MNT_SYNCHRONOUS check in vn_write ca1ce50b2b5ef11d ("vfs: add more safety against concurrent forced unmount to vn_write") has a side effect of only checking MNT_SYNCHRONOUS if O_FSYNC is set. Reviewed By: mjg Differential Revision: https://reviews.freebsd.org/D30610
# 478c52f1	29-May-2021	Mateusz Guzik <mjg@FreeBSD.org>	vfs: slightly rework vn_rlimit_fsize
# e71d5c73	22-May-2021	Mateusz Guzik <mjg@FreeBSD.org>	Fix limit testing after 1762f674ccb571e6 ktrace commit. The previous: if ((uoff_t)uio->uio_offset + uio->uio_resid > lim) signal(....); was replaced with: if ((uoff_t)uio->uio_offset + uio->uio_resid < lim) return; signal(....); Making (uoff_t)uio->uio_offset + uio->uio_resid == lim trip over the limit, when it did not previously. Unbreaks running 13.0 buildworld.
# 48235c37	22-May-2021	Mateusz Guzik <mjg@FreeBSD.org>	Fix a braino in previous. Instead of trying to partially ifdef out ktrace handling, define the missing identifier to 0. Without this fix lack of ktrace in the kernel also means there is no SIGXFSZ signal delivery.
# 154f0ecc	22-May-2021	Mateusz Guzik <mjg@FreeBSD.org>	Fix tinderbox build after 1762f674ccb571e6 ktrace commit.
# ea2b64c2	18-May-2021	Konstantin Belousov <kib@FreeBSD.org>	ktrace: add a kern.ktrace.filesize_limit_signal knob When enabled, writes to ktrace.out that exceed the max file size limit cause SIGXFSZ as it should be, but note that the limit is taken from the process that initiated ktrace. When disabled, write is blocked, but signal is not send. Note that in either case ktrace for the affected process is stopped. Requested and reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30257
# 02645b88	14-May-2021	Konstantin Belousov <kib@FreeBSD.org>	ktrace: use the limit of the trace initiator for file size limit on writes Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30257
# 9bb84c23	13-May-2021	Konstantin Belousov <kib@FreeBSD.org>	accounting: explicitly mark the exiting thread as doing accounting and use the mark to stop applying file size limits on the write of the accounting record. This allows to remove hack to clear process limits in acct_process(), and avoids the bug with the clearing being ineffective because limits are also cached in the thread structure. Reported and reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30257
# ca1ce50b	14-May-2021	Mateusz Guzik <mjg@FreeBSD.org>	vfs: add more safety against concurrent forced unmount to vn_write 1. stop re-reading ->v_mount (can become NULL) 2. stop re-reading ->v_type (can change to VBAD)
# 6de3cf14	12-May-2021	Konstantin Belousov <kib@FreeBSD.org>	vn_open_cred(): disallow O_CREAT \| O_EMPTY_PATH This combination does not make sense, and cannot be satisfied by lookup. In particular, lookup cannot supply dvp, it only can directly return vp. Reported and reviewed by: markj using syzkaller Sponsored by: The FreeBSD Foundation MFC after: 3 days
# 5e7cdf18	06-May-2021	Konstantin Belousov <kib@FreeBSD.org>	openat(2): add O_EMPTY_PATH It reopens the passed file descriptor, checking the file backing vnode' current access rights against open mode. In particular, this flag allows to convert file descriptor opened with O_PATH, into operable file descriptor, assuming permissions allow that. Reviewed by: markj Tested by: Andrew Walker <awalker@ixsystems.com> Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30148
# 4f592683	02-May-2021	Rick Macklem <rmacklem@FreeBSD.org>	copy_file_range(2): improve copying of a large hole to EOF PR#255523 reported that a file copy for a file with a large hole to EOF on ZFS ran slowly over NFSv4.2. The problem was that vn_generic_copy_file_range() would loop around reading the hole's data and then see it is all 0s. It was coded this way since UFS always allocates a data block near the end of the file, such that a hole to EOF never exists. This patch modifies vn_generic_copy_file_range() to check for a ENXIO returned from VOP_IOCTL(..FIOSEEKDATA..) and handle that case as a hole to EOF. asomers@ confirms that it works for his ZFS test case. PR: 255523 Tested by: asomers Reviewed by: asomers MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D30076
# 20825657	28-Apr-2021	Konstantin Belousov <kib@FreeBSD.org>	O_PATH: disable kqfilter for fifos Filter on fifos is real filter for the object, and not a filesystem events filter like EVFILT_VNODE. Reported by: markj using syzkaller Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 3 days
# 54f98c4d	19-Apr-2021	Konstantin Belousov <kib@FreeBSD.org>	vn_open_vnode(): handle error when fp == NULL If VOP_ADD_WRITECOUNT() or adv locking failed, so VOP_CLOSE() needs to be called, we cannot use fp fo_close() when there is no fp. This occurs when e.g. kernel code directly calls vn_open() instead of the open(2) syscall. In this case, VOP_CLOSE() can be called directly, after possible lock upgrade. Reported by: nvass@gmx.com PR: 255119 Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29830
# bbf7a4e8	07-Apr-2021	Konstantin Belousov <kib@FreeBSD.org>	O_PATH: allow vnode kevent filter on such files if VREAD access is checked as allowed during open Requested by: wulf Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29323
# f9b923af	06-Apr-2021	Konstantin Belousov <kib@FreeBSD.org>	O_PATH: Allow to open symlink When O_NOFOLLOW is specified, namei() returns the symlink itself. In this case, open(O_PATH) should be allowed, to denote the location of symlink itself. Prevent O_EXEC in this case, execve(2) code is not ready to try to execute symlinks. Reported by: wulf Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29323
# 8d9ed174	17-Mar-2021	Konstantin Belousov <kib@FreeBSD.org>	open(2): Implement O_PATH Reviewed by: markj Tested by: pho Discussed with: walker.aj325_gmail.com, wulf Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29323
# 437c241d	17-Mar-2021	Konstantin Belousov <kib@FreeBSD.org>	vfs_vnops.c: Make vn_statfile() non-static Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29323
# 42be0a7b	17-Mar-2021	Konstantin Belousov <kib@FreeBSD.org>	Style. Add missed spaces, wrap long lines. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D29323
# 20e91ca3	15-Feb-2021	Konstantin Belousov <kib@FreeBSD.org>	open(2): Remove O_BENEATH and AT_BENEATH with the reasoning that the flags did not worked properly, and were not shipped in a release. O_RESOLVE_BENEATH is kept as useful. Reviewed by: markj Tested by: arichardson, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D28907
# a5f9fe2b	01-Mar-2021	Rick Macklem <rmacklem@FreeBSD.org>	copy_file_range(2): Fix for small values of input file offset and len r366302 broke copy_file_range(2) for small values of input file offset and len. It was possible for rem to be greater than len and then "len - rem" was a large value, since both variables are unsigned. Reported by: koobs, Pablo <pablogsal gmail com> (Python) Reviewed by: asomers, koobs MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D28981
# fa3bd463	29-Jan-2021	Konstantin Belousov <kib@FreeBSD.org>	lockf: ensure atomicity of lockf for open(O_CREAT\|O_EXCL\|O_EXLOCK) or EX_SHLOCK. Do it by setting a vnode iflag indicating that the locking exclusive open is in progress, and not allowing F_LOCK request to make a progress until the first open finishes. Requested by: mckusick Reviewed by: markj, mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28697
# c61fae14	14-Feb-2021	Konstantin Belousov <kib@FreeBSD.org>	pgcache read: protect against reads past end of the vm object size If uio_offset is past end of the object size, calculated resid is negative. Delegate handling this case to the locked read, as any other non-trivial situation. PR: 253158 Reported by: Harald Schmalzbauer <bugzilla.freebsd@omnilan.de> Tested by: cy Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 3b2aa360	28-Jan-2021	Konstantin Belousov <kib@FreeBSD.org>	Use VOP_VPUT_PAIR() for eligible VFS syscalls. The current list is limited to the cases where UFS needs to handle vput(dvp) specially. Which means VOP_CREATE(), VOP_MKDIR(), VOP_MKNOD(), VOP_LINK(), and VOP_SYMLINK(). Reviewed by: chs, mkcusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation
# ee965dfa	03-Feb-2021	Konstantin Belousov <kib@FreeBSD.org>	vn_open(): If the vnode is reclaimed during open(2), do not return error. Most future operations on the returned file descriptor will fail anyway, and application should be ready to handle that failures. Not forcing it to understand the transient failure mode on open, which is implementation-specific, should make us less special without loss of reporting of errors. Suggested by: chs Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation
# ef23df13	13-Jan-2021	Mateusz Guzik <mjg@FreeBSD.org>	vfs: set NC_KEEPPOSENTRY alongside NOCACHE when creating a file Arguably the entire NOCACHE logic should get retired, in the meantime at least prevent the code from evicting existing entries.
# a5e28403	07-Jan-2021	Thomas Munro <tmunro@FreeBSD.org>	open(2): Add O_DSYNC flag. POSIX O_DSYNC means that writes include an implicit fdatasync(2), just as O_SYNC implies fsync(2). VOP_WRITE() functions that understand the new IO_DATASYNC flag can act accordingly, but we'll still pass down IO_SYNC so that file systems that don't understand it will continue to provide the stronger O_SYNC behaviour. Flag also applies to fcntl(2). Reviewed by: kib, delphij Differential Revision: https://reviews.freebsd.org/D25090
# 11403bde	06-Jan-2021	Chuck Silvers <chs@FreeBSD.org>	vfs: fix rangelock range in vn_rdwr() for IO_APPEND vn_rdwr() must lock the entire file range for IO_APPEND just like vn_io_fault() does for O_APPEND. Reviewed by: kib, imp, mckusick Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D28008
# 3e506a67	27-Dec-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: add v_irflag accessors Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D27793
# 99c66d3a	26-Nov-2020	Konstantin Belousov <kib@FreeBSD.org>	vn_read_from_obj(): fix handling of doomed vnodes. There is no reason why vp->v_object cannot be NULL. If it is, it's fine, handle it by delegating to VOP_READ(). Tested by: pho Reviewed by: markj, mjg Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27327
# 3b1f974b	26-Nov-2020	Konstantin Belousov <kib@FreeBSD.org>	Make max ticks for pause in vn_lock_pair() adjustable at runtime. Reduce default value from hz / 10 to hz / 100. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation
# e75f0f2b	20-Nov-2020	Kirk McKusick <mckusick@FreeBSD.org>	Only attempt a VOP_UNLOCK() when the vn_lock() has been successful. No MFC as this code is not present in 12-stable. Reported by: Peter Holm Reviewed by: Mateusz Guzik Tested by: Peter Holm Sponsored by: Netflix
# 7cde2ec4	13-Nov-2020	Konstantin Belousov <kib@FreeBSD.org>	Implement vn_lock_pair(). In collaboration with: pho Reviewed by: mckusick (previous version), markj (previous version) Tested by: markj (syzkaller), pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26136
# f6dd1aef	09-Nov-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: group mount per-cpu vars into one struct While here move frequently read stuff into the same cacheline. This shrinks struct mount by 64 bytes. Tested by: pho
# eebc2e45	28-Oct-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: add NDREINIT to facilitate repeated namei calls struct nameidata mixes caller arguments, internal state and output, which can be quite error prone. Recent addition of valdiating ni_resflags uncovered a caller which could repeatedly call namei, effectively operating on partially populated state. Add bare minimium validation this does not happen. The real fix would decouple aforementioned state. Reported by: pho Tested by: pho (different variant)
# deb1339f	09-Oct-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: fix a panic when truncating comming from copy_file_range Truncating requires an exclusive lock, but it was not taken if the filesystem indicates support for shared writes. This only concerns ZFS. In particular fixes cp of files which have trailing holes. Reported by: bdrewery
# 19fe23fa	08-Oct-2020	Rick Macklem <rmacklem@FreeBSD.org>	Make vn_generic_copy_file_range() interruptible via a signal. Without this patch, when vn_generic_copy_file_range() is doing a large copy, it will remain in the function for a considerable amount of time, delaying handling of any outstanding signals until the copy completes. This patch adds checks for signals that need to be processed after each successful data copy cycle. When sig_intr() returns non-zero, vn_generic_copy_file_range() will return. The check "if (len < savlen)" ensures that some data has been copied, so that progress will be made. Note that, since copy_file_range(2) is allowed to return fewer bytes copied than requested, it will never return EINTR/ERESTART when sig_intr() returns non-zero. Reviewed by: kib, asomers Differential Revision: https://reviews.freebsd.org/D26620
# 961afe3c	30-Sep-2020	Rick Macklem <rmacklem@FreeBSD.org>	Clip the "len" argument to vn_generic_copy_file_range() at a hole size boundary. By clipping the len argument of vn_generic_copy_file_range() to end at an exact multiple of hole size, holes are more likely to be maintained during the copy. A hole can still straddle the boundary at the end of the copy range, resulting in a block being allocated in the output file as it is being grown in size, but this will reduce the likelyhood of this happening. While here, also modify setting of blksize to better handle the case where _PC_MIN_HOLE_SIZE is returned as 1. Reviewed by: asomers Differential Revision: https://reviews.freebsd.org/D26570
# 164aa1e9	29-Sep-2020	Rick Macklem <rmacklem@FreeBSD.org>	Make copy_file_range(2) Linux compatible for overflow of offset + len. Without this patch, if a call to copy_file_range(2) specifies an input file offset + len that would wrap around, EINVAL is returned. I thought that was the Linux behaviour, but recent testing showed that Linux accepts this case and does the copy_file_range() to EOF. This patch changes the FreeBSD code to exhibit the same behaviour as Linux for this case. Reviewed by: asomers, kib Differential Revision: https://reviews.freebsd.org/D26569
# 1317da43	22-Sep-2020	Konstantin Belousov <kib@FreeBSD.org>	Add O_RESOLVE_BENEATH and AT_RESOLVE_BENEATH to mimic Linux' RESOLVE_BENEATH. It is like O_BENEATH, but disables to walk out of the subtree rooted in the starting directory. O_BENEATH does not care if path walks out if it returned. Requested by: Dan Gohman <dev@sunfishcode.online> PR: 248335 Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D25886
# 4a0b316d	22-Sep-2020	Konstantin Belousov <kib@FreeBSD.org>	Add open2nameif() the helper to calculate namei flags both for open(2) and creat(2). Suggested and reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D25886
# f9cc8410	18-Sep-2020	Eric van Gyzen <vangyzen@FreeBSD.org>	vm_ooffset_t is now unsigned vm_ooffset_t is now unsigned. Remove some tests for negative values, or make other adjustments accordingly. Reported by: Coverity Reviewed by: kib markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D26214
# 3c484f32	15-Sep-2020	Konstantin Belousov <kib@FreeBSD.org>	Convert page cache read to VOP. There are several negative side-effects of not calling into VOP layer at all for page cache reads. The biggest is the missed activation of EVFILT_READ knotes. Also, it allows filesystem to make more fine grained decision to refuse read from page cache. Keep VIRF_PGREAD flag around, it is still useful for nullfs, and for asserts. Reviewed by: markj Tested by: pho Discussed with: mjg Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26346
# 88863665	15-Sep-2020	Konstantin Belousov <kib@FreeBSD.org>	vfs_subr.c: export io_hold_cnt and vn_read_from_obj(). Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26346
# feabaaf9	24-Aug-2020	Mateusz Guzik <mjg@FreeBSD.org>	cache: drop the always curthread argument from reverse lookup routines Note VOP_VPTOCNP keeps getting it as temporary compatibility for zfs. Tested by: pho
# beb27033	16-Aug-2020	Konstantin Belousov <kib@FreeBSD.org>	Fix powerpc build. Sponsored by: The FreeBSD Foundation
# fbca789f	16-Aug-2020	Konstantin Belousov <kib@FreeBSD.org>	VMIO read If possible, i.e. if the requested range is resident valid in the vm object queue, and some secondary conditions hold, copy data for read(2) directly from the valid cached pages, avoiding vnode lock and instantiating buffers. I intentionally do not start read-ahead, nor handle the advises on the cached range. Filesystems indicate support for VMIO reads by setting VIRF_PGREAD flag, which must not be cleared until vnode reclamation. Currently only filesystems that use vnode pager for v_objects can enable it, due to reliance on vnp_size. There is a WIP to handle it for tmpfs. Reviewed by: markj Discussed with: jeff Tested by: pho Benchmarked by: mjg Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D25968
# 51ea7bea	07-Aug-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: add VOP_STAT The current scheme of calling VOP_GETATTR adds avoidable overhead. An example with tmpfs doing fstat (ops/s): before: 7488958 after: 7913833 Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D25910
# 3ea3fbe6	16-Jul-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: fix vn_poll performance with either MAC or AUDIT The code would unconditionally lock the vnode to audit or call the mac hoook, even if neither want to do anything. Pre-check the state to avoid locking in the common case of nothing to do. Note this code should not be normally executed anyway as vnodes are always return ready. However, poll1/2 from will-it-scale use regular files for benchmarking, presumably to focus on the interface itself as the vnode handler is not supposed to do almost anything. This in particular fixes poll2 which passes 128 fds. $ ./poll2_processes -s 10 before: 134411 after: 271572
# ab06a305	16-Jul-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: fix MAC/AUDIT mismatch in vn_poll Auditing would not be performed without MAC compiled in.
# 422f38d8	10-Jul-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: fix trivial whitespace issues which don't interefere with blame .. even without the -w switch
# 4543c1c3	05-Jul-2020	Konstantin Belousov <kib@FreeBSD.org>	Fix typo. Sponsored by: The FreeBSD Foundation MFC after: 3 days
# f2706588	21-Jun-2020	Thomas Munro <tmunro@FreeBSD.org>	vfs: track sequential reads and writes separately For software like PostgreSQL and SQLite that sometimes reads sequentially while also writing sequentially some distance behind with interleaved syscalls on the same fd, performance is better on UFS if we do sequential access heuristics separately for reads and writes. Patch originally by Andrew Gierth in 2008, updated and proposed by me with his permission. Reviewed by: mjg, kib, tmunro Approved by: mjg (mentor) Obtained from: Andrew Gierth <andrew@tao11.riddles.org.uk> Differential Revision: https://reviews.freebsd.org/D25024
# 63619b6d	04-Jun-2020	Kyle Evans <kevans@FreeBSD.org>	vfs: add restrictions to read(2) of a directory [2/2] This commit adds the priv(9) that waters down the sysctl to make it only allow read(2) of a dirfd by the system root. Jailed root is not allowed, but jail policy and superuser policy will abstain from allowing/denying it so that a MAC module can fully control the policy. Such a MAC module has been written, and can be found at: https://people.freebsd.org/~kevans/mac_read_dir-0.1.0.tar.gz It is expected that the MAC module won't be needed by many, as most only need to do such diagnostics that require this behavior as system root anyways. Interested parties are welcome to grab the MAC module above and create a port or locally integrate it, and with enough support it could see introduction to base. As noted in mac_read_dir.c, it is released under the BSD 2 clause license and allows the restrictions to be lifted for only jailed root or for all unprivileged users. PR: 246412 Reviewed by: mckusick, kib, emaste, jilles, cy, phk, imp (all previous) Reviewed by: rgrimes (latest version) Differential Revision: https://reviews.freebsd.org/D24596
# dcef4f65	04-Jun-2020	Kyle Evans <kevans@FreeBSD.org>	vfs: add restrictions to read(2) of a directory [1/2] Historically, we've allowed read() of a directory and some filesystems will accommodate (e.g. ufs/ffs, msdosfs). From the history department staffed by Warner: <<EOF pdp-7 unix seemed to allow reading directories, but they were weird, special things there so I'm unsure (my pdp-7 assembler sucks). 1st Edition's sources are lost, mostly. The kernel allows it. The reconstructed sources from 2nd or 3rd edition read it though. V6 to V7 changed the filesystem format, and should have been a warning, but reading directories weren't materially changed. 4.1b BSD introduced readdir because of UFS. UFS broke all directory reading programs in 1983. ls, du, find, etc all had to be rewritten. readdir() and friends were introduced here. SysVr3 picked up readdir() in 1987 for the AT&T fork of Unix. SysVr4 updated all the directory reading programs in 1988 because different filesystem types were introduced. In the 90s, these interfaces became completely ubiquitous as PDP-11s running V7 faded from view and all the folks that initially started on V7 upgraded to SysV. Linux never supported this (though I've not done the software archeology to check) because it has always had a pathological diversity of filesystems. EOF Disallowing read(2) on a directory has the side-effect of masking application bugs from relying on other implementation's behavior (e.g. Linux) of rejecting these with EISDIR across the board, but allowing it has been a vector for at least one stack disclosure bug in the past[0]. By POSIX, this is implementation-defined whether read() handles directories or not. Popular implementations have chosen to reject them, and this seems sensible: the data you're reading from a directory is not structured in some unified way across filesystem implementations like with readdir(2), so it is impossible for applications to portably rely on this. With this patch, we will reject most read(2) of a dirfd with EISDIR. Users that know what they're doing can conscientiously set bsd.security.allow_read_dir=1 to allow read(2) of directories, as it has proven useful for debugging or recovery. A future commit will further limit the sysctl to allow only the system root to read(2) directories, to make it at least relatively safe to leave on for longer periods of time. While we're adding logic pertaining to directory vnodes to vn_io_fault, an additional assertion has also been added to ensure that we're not reaching vn_io_fault with any write request on a directory vnode. Such request would be a logical error in the kernel, and must be debugged rather than allowing it to potentially silently error out. Commented out shell aliases have been placed in root's chsrc/shrc to promote awareness that grep may become noisy after this change, depending on your usage. A tentative MFC plan has been put together to try and make it as trivial as possible to identify issues and collect reports; note that this will be strongly re-evaluated. Tentatively, I will MFC this knob with the default as it is in HEAD to improve our odds of actually getting reports. The future priv(9) to further restrict the sysctl WILL NOT BE MERGED BACK, so the knob will be a faithful reversion on stable/12. We will go into the merge acknowledging that the sysctl default may be flipped back to restore historical behavior at any point if it's warranted. [0] https://www.freebsd.org/security/advisories/FreeBSD-SA-19:10.ufs.asc PR: 246412 Reviewed by: mckusick, kib, emaste, jilles, cy, phk, imp (all previous) Reviewed by: rgrimes (latest version) MFC after: 1 month (note the MFC plan mentioned above) Relnotes: absolutely, but will amend previous RELNOTES entry Differential Revision: https://reviews.freebsd.org/D24596
# e3d16bb6	24-May-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: use atomic_{store,load}_long to manage f_offset ... instead of depending on the compiler not to mess them up
# 442e617f	24-May-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: restore mtx-protected foffset locking for 32 bit platforms They depend on it to accurately read the offset. The new code is not used as it would add an interrupt enable/disable trip on top of the atomic. This also fixes a bug where 32-bit nolock request would still lock the offset. No changes for 64-bit. Reported by: emaste
# 3fc40153	23-May-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: scale foffset_lock by using atomics instead of serializing on mtx pool Contending cases still serialize on sleepq (which would be taken anyway). Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D21626
# 90f29198	09-Apr-2020	Konstantin Belousov <kib@FreeBSD.org>	Remove extra call to vfs_op_exit() from vfs_write_suspend() when VFS_SYNC() fails. The vfs_write_resume() handler already does vfs_op_exit() for us. Reported by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation
# 2782c00c	22-Feb-2020	Ryan Libby <rlibby@FreeBSD.org>	vfs: quiet -Wwrite-strings Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23797
# 074ad60a	15-Feb-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: make write suspension mandatory At the time opt-in was introduced adding yourself as a writer was esrializing across the mount point. Nowadays it is fully per-cpu, the only impact being a small single-threaded hit on top of what's there right now. Vast majority of the overhead stems from the call to VOP_GETWRITEMOUNT which has is done regardless. Should someone want to microoptimize this single-threaded they can coalesce looking the mount up with adding a write to it.
# 7b2ff0dc	13-Feb-2020	Mateusz Guzik <mjg@FreeBSD.org>	Partially decompose priv_check by adding priv_check_cred_vfs_generation During buildkernel there are very frequent calls to priv_check and they all are for PRIV_VFS_GENERATION (coming from stat/fstat). This results in branching on several potential privileges checking if perhaps that's the one which has to be evaluated. Instead of the kitchen-sink approach provide a way to have commonly used privs directly evaluated.
# 2f7f11b7	08-Feb-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: tidy up vget_finish and vn_lock - remove assertion which duplicates vn_lock - use VNPASS instead of retyping the failure - report what flags were passed if panicking on them
# 3ff65f71	30-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	Remove duplicated empty lines from kern/*.c No functional changes.
# 2856d85e	08-Jan-2020	Kyle Evans <kevans@FreeBSD.org>	posix_fallocate: push vnop implementation into the fileop layer This opens the door for other descriptor types to implement posix_fallocate(2) as needed. Reviewed by: kib, bcr (manpages) Differential Revision: https://reviews.freebsd.org/D23042
# 7e2ea577	04-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: factor out avoidable branches in _vn_lock
# b249ce48	03-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: drop the mostly unused flags argument from VOP_UNLOCK Filesystems which want to use it in limited capacity can employ the VOP_UNLOCK_FLAGS macro. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D21427
# abd80ddb	08-Dec-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: introduce v_irflag and make v_type smaller The current vnode layout is not smp-friendly by having frequently read data avoidably sharing cachelines with very frequently modified fields. In particular v_iflag inspected for VI_DOOMED can be found in the same line with v_usecount. Instead make it available in the same cacheline as the v_op, v_data and v_type which all get read all the time. v_type is avoidably 4 bytes while the necessary data will easily fit in 1. Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new flag field with a new value: VIRF_DOOMED. Reviewed by: kib, jeff Differential Revision: https://reviews.freebsd.org/D22715
# fdc6b10d	29-Nov-2019	Konstantin Belousov <kib@FreeBSD.org>	Add a VN_OPEN_INVFS flag. vn_open_cred() assumes that it is called from the top-level of a VFS syscall. Writers must call bwillwrite() before locking any VFS resource to wait for cleanup of dirty buffers. ZFS getextattr() and setextattr() VOPs do call vn_open_cred(), which results in wait for unrelated buffers while owning ZFS vnode lock (and ZFS does not use buffer cache). VN_OPEN_INVFS allows caller to skip bwillwrite. Note that ZFS is still incorrect there, because it starts write on an mp and locks a vnode while holding another vnode lock. Reported by: Willem Jan Withagen <wjw@digiware.nl> Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 48e48578	09-Nov-2019	Rick Macklem <rmacklem@FreeBSD.org>	Update copy_file_range(2) to be Linux5 compatible. The current linux man page and testing done on a fairly recent linux5.n kernel have identified two changes to the semantics of the linux copy_file_range system call. Since the copy_file_range(2) system call is intended to be linux compatible and is only currently in head/current and not used by any commands, it seems appropriate to update the system call to be compatible with the current linux one. The first of these semantic changes was changed to be compatible with linux5.n by r354564. For the second semantic change, the old linux man page stated that, if infd and outfd referred to the same file, EBADF should be returned. Now, the semantics is to allow infd and outfd to refer to the same file so long as the byte ranges defined by the input file offset, output file offset and len does not overlap. If the byte ranges do overlap, EINVAL should be returned. This patch modifies copy_file_range(2) to be linux5.n compatible for this semantic change.
# 15930ae1	08-Nov-2019	Rick Macklem <rmacklem@FreeBSD.org>	Update copy_file_range(2) to be Linux5 compatible. The current linux man page and testing done on a fairly recent linux5.n kernel have identified two changes to the semantics of the linux copy_file_range system call. Since the copy_file_range(2) system call is intended to be linux compatible and is only currently in head/current and not used by any commands, it seems appropriate to update the system call to be compatible with the current linux one. The old linux man page stated that, if the offset + len exceeded file_size for the input file, EINVAL should be returned. Now, the semantics is to copy up to at most file_size bytes and return that number of bytes copied. If the offset is at or beyond file_size, a return of 0 bytes is done. This patch modifies copy_file_range(2) to be linux compatible for this semantic change. A separate patch will change copy_file_range(2) for the other semantic change, which allows the infd and outfd to refer to the same file, so long as the byte ranges do not overlap.
# 9bb37c03	16-Oct-2019	Andrew Turner <andrew@FreeBSD.org>	Stop leaking information from the kernel through timespec The timespec struct holds a seconds value in a time_t and a nanoseconds value in a long. On most architectures these are the same size, however on 32-bit architectures other than i386 time_t is 8 bytes and long is 4 bytes. Most ABIs will then pad a struct holding an 8 byte and 4 byte value to 16 bytes with 4 bytes of padding. When copying one of these structs the compiler is free to copy the padding if it wishes. In this case the padding may contain kernel data that is then leaked to userspace. Fix this by copying the timespec elements rather than the entire struct. This doesn't affect Tier-1 architectures so no SA is expected. admbugs: 651 MFC after: 1 week Sponsored by: DARPA, AFRL
# 55894117	17-Sep-2019	Konstantin Belousov <kib@FreeBSD.org>	Return EISDIR when directory is opened with O_CREAT without O_DIRECTORY. Reviewed by: bcr (man page), emaste (previous version) PR: 240452 Sponsored by: The FreeBSD Foundation MFC after: 1 week DIfferential revision: https://reviews.freebsd.org/D21634
# 4cace859	16-Sep-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: convert struct mount counters to per-cpu There are 3 counters modified all the time in this structure - one for keeping the structure alive, one for preventing unmount and one for tracking active writers. Exact values of these counters are very rarely needed, which makes them a prime candidate for conversion to a per-cpu scheme, resulting in much better performance. Sample benchmark performing fstatfs (modifying 2 out of 3 counters) on a 104-way 2 socket Skylake system: before: 852393 ops/s after: 76682077 ops/s Reviewed by: kib, jeff Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21637
# e87f3f72	16-Sep-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: manage mnt_writeopcount with atomics See r352424. Reviewed by: kib, jeff Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21575
# fe7bcbaf	03-Sep-2019	Kyle Evans <kevans@FreeBSD.org>	vm pager: writemapping accounting for OBJT_SWAP Currently writemapping accounting is only done for vnode_pager which does some accounting on the underlying vnode. Extend this to allow accounting to be possible for any of the pager types. New pageops are added to update/release writecount that need to be implemented for any pager wishing to do said accounting, and we implement these methods now for both vnode_pager (unchanged) and swap_pager. The primary motivation for this is to allow other systems with OBJT_SWAP objects to check if their objects have any write mappings and reject operations with EBUSY if so. posixshm will be the first to do so in order to reject adding write seals to the shmfd if any writable mappings exist. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D21456
# 95acb40c	27-Aug-2019	Konstantin Belousov <kib@FreeBSD.org>	vn_vget_ino_gen(): relock the lower vnode on error. The function' interface assumes that the lower vnode is passed and returned locked always. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week
# df9bc7df	21-Aug-2019	Rick Macklem <rmacklem@FreeBSD.org>	Map ENOTTY to EINVAL for lseek(SEEK_DATA/SEEK_HOLE). Without this patch, when an application performed lseek(SEEK_DATA/SEEK_HOLE) on a file in a file system that does not have its own VOP_IOCTL(), the lseek(2) fails with errno ENOTTY. This didn't seem appropriate, since ENOTTY is not listed as an error return by either the lseek(2) man page nor the POSIX draft for lseek(2). This was discussed on freebsd-current@ here: http://docs.FreeBSD.org/cgi/mid.cgi?CAOtMX2iiQdv1+15e1N_r7V6aCx_VqAJCTP1AW+qs3Yg7sPg9wA This trivial patch maps ENOTTY to EINVAL for lseek(SEEK_DATA/SEEK_HOLE). Reviewed by: markj Relnotes: yes Differential Revision: https://reviews.freebsd.org/D21300
# c61b1431	15-Aug-2019	Rick Macklem <rmacklem@FreeBSD.org>	Fix copy_file_range(2) so that unneeded blocks are not allocated to the output file. When the byte range for copy_file_range(2) doesn't go to EOF on the output file and there is a hole in the input file, a hole must be "punched" in the output file. This is done by writing a block of bytes all set to 0. Without this patch, the write is done unconditionally which means that, if the output file already has a hole in that byte range, a unneeded data block of all 0 bytes would be allocated. This patch adds code to check for a hole in the output file, so that it can skip doing the write if there is already a hole in that byte range of the output file. This avoids unnecessary allocation of blocks to the output file. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D21155
# 6b1bc6f7	08-Aug-2019	Rick Macklem <rmacklem@FreeBSD.org>	Remove some harmless cruft from vn_generic_copy_file_range(). An earlier version of the patch had code that set "error" between line#s 2797-2799. When that code was moved, the second check for "error != 0" could never be true and the check became harmless cruft. This patch removes the cruft, mainly to make Coverity happy. Reported by: asomers, cem
# 61463314	08-Aug-2019	Rick Macklem <rmacklem@FreeBSD.org>	Fix copy_file_range(2) for an unlikely race during hole finding. Since the VOP_IOCTL(FIOSEEKDATA/FIOSEEKHOLE) calls are done with the vnode unlocked, it is possible for another thread to do: - truncate(), lseek(), write() between the two calls and create a hole where FIOSEEKDATA returned the start of data. For this case, VOP_IOCTL(FIOSEEKHOLE) will return the same offset for the hole location. This could result in an infinite loop in the copy code, since copylen is set to 0 and the copy doesn't advance. Usually, this race is avoided because of the use of rangelocks, but the NFS server does not do range locking and could do a sequence like the above to create the hole. This patch checks for this case and makes the hole search fail, to avoid the infinite loop. At this time, it is an open question as to whether or not the NFS server should do range locking to avoid this race.
# bbbbeca3	24-Jul-2019	Rick Macklem <rmacklem@FreeBSD.org>	Add kernel support for a Linux compatible copy_file_range(2) syscall. This patch adds support to the kernel for a Linux compatible copy_file_range(2) syscall and the related VOP_COPY_FILE_RANGE(9). This syscall/VOP can be used by the NFSv4.2 client to implement the Copy operation against an NFSv4.2 server to do file copies locally on the server. The vn_generic_copy_file_range() function in this patch can be used by the NFSv4.2 server to implement the Copy operation. Fuse may also me able to use the VOP_COPY_FILE_RANGE() method. vn_generic_copy_file_range() attempts to maintain holes in the output file in the range to be copied, but may fail to do so if the input and output files are on different file systems with different _PC_MIN_HOLE_SIZE values. Separate commits will be done for the generated syscall files and userland changes. A commit for a compat32 syscall will be done later. Reviewed by: kib, asomers (plus comments by brooks, jilles) Relnotes: yes Differential Revision: https://reviews.freebsd.org/D20584
# 0122532e	17-Jul-2019	Alan Somers <asomers@FreeBSD.org>	F_READAHEAD: Fix r349248's overflow protection, broken by r349391 I accidentally broke the main point of r349248 when making stylistic changes in r349391. Restore the original behavior, and also fix an additional overflow that was possible when uio->uio_resid was nearly SSIZE_MAX. Reported by: cem Reviewed by: bde MFC after: 2 weeks MFC-With: 349248 Sponsored by: The FreeBSD Foundation
# 555d8f28	01-Jul-2019	Rick Macklem <rmacklem@FreeBSD.org>	Factor out the code that does a VOP_SETATTR(size) from vn_truncate(). This patch factors the code in vn_truncate() that does the actual VOP_SETATTR() of size into a separate function called vn_truncate_locked(). This will allow the NFS server and the patch that adds a copy_file_range(2) syscall to call this function instead of duplicating the code and carrying over changes, such as the recent r347151. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D20808
# 0cfc1ef3	27-Jun-2019	Alan Somers <asomers@FreeBSD.org>	FIOBMAP2: inline vn_ioc_bmap2 Reported by: kib Reviewed by: kib MFC after: 2 weeks MFC-With: 349238 Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20783
# 4f53d57e	25-Jun-2019	Alan Somers <asomers@FreeBSD.org>	fcntl: style changes to r349248 Reported by: bde MFC after: 2 weeks MFC-With: 349248 Sponsored by: The FreeBSD Foundation
# 38b06f8a	20-Jun-2019	Alan Somers <asomers@FreeBSD.org>	fcntl: fix overflow when setting F_READAHEAD VOP_READ and VOP_WRITE take the seqcount in blocks in a 16-bit field. However, fcntl allows you to set the seqcount in bytes to any nonnegative 31-bit value. The result can be a 16-bit overflow, which will be sign-extended in functions like ffs_read. Fix this by sanitizing the argument in kern_fcntl. As a matter of policy, limit to IO_SEQMAX rather than INT16_MAX. Also, fifos have overloaded the f_seqcount field for a completely different purpose ever since r238936. Formalize that by using a union type. Reviewed by: cem MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20710
# d49b446b	20-Jun-2019	Alan Somers <asomers@FreeBSD.org>	Add FIOBMAP2 ioctl This ioctl exposes VOP_BMAP information to userland. It can be used by programs like fragmentation analyzers and optimized cp implementations. But I'm using it to test fusefs's VOP_BMAP implementation. The "2" in the name distinguishes it from the similar but incompatible FIBMAP ioctls in NetBSD and Linux. FIOBMAP2 differs from FIBMAP in that it uses a 64-bit block number instead of 32-bit, and it also returns runp and runb. Reviewed by: mckusick MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20705
# daec9284	21-May-2019	Conrad Meyer <cem@FreeBSD.org>	Include ktr.h in more compilation units Similar to r348026, exhaustive search for uses of CTRn() and cross reference ktr.h includes. Where it was obvious that an OS compat header of some kind included ktr.h indirectly, .c files were left alone. Some of these files clearly got ktr.h via header pollution in some scenarios, or tinderbox would not be passing prior to this revision, but go ahead and explicitly include it in files using it anyway. Like r348026, these CUs did not show up in tinderbox as missing the include. Reported by: peterj (arm64/mp_machdep.c) X-MFC-With: r347984 Sponsored by: Dell EMC Isilon
# 78022527	05-May-2019	Konstantin Belousov <kib@FreeBSD.org>	Switch to use shared vnode locks for text files during image activation. kern_execve() locks text vnode exclusive to be able to set and clear VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0 condition. The change removes VV_TEXT, replacing it with the condition v_writecount <= -1, and puts v_writecount under the vnode interlock. Each text reference decrements v_writecount. To clear the text reference when the segment is unmapped, it is recorded in the vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and v_writecount is incremented on the map entry removal The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that v_writecount does not contradict the desired change. vn_writecheck() is now racy and its use was eliminated everywhere except access. Atomic check for writeability and increment of v_writecount is performed by the VOP. vn_truncate() now increments v_writecount around VOP_SETATTR() call, lack of which is arguably a bug on its own. nullfs bypasses v_writecount to the lower vnode always, so nullfs vnode has its own v_writecount correct, and lower vnode gets all references, since object->handle is always lower vnode. On the text vnode' vm object dealloc, the v_writecount value is reset to zero, and deadfs vop_unset_text short-circuit the operation. Reclamation of lowervp always reclaims all nullfs vnodes referencing lowervp first, so no stray references are left. Reviewed by: markj, trasz Tested by: mjg, pho Sponsored by: The FreeBSD Foundation MFC after: 1 month Differential revision: https://reviews.freebsd.org/D19923
# ae909414	09-Apr-2019	Konstantin Belousov <kib@FreeBSD.org>	Add vn_fsync_buf(). Provide a convenience function to avoid the hack with filling fake struct vop_fsync_args and then calling vop_stdfsync(). Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 4ae3f5a7	05-Apr-2019	Konstantin Belousov <kib@FreeBSD.org>	vn_vmap_seekhole(): align running offset to the block boundary. Otherwise we might miss the last iteration where EOF appears below unaligned noff. Reported and reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D19811
# 4f77f488	25-Oct-2018	Konstantin Belousov <kib@FreeBSD.org>	Implement O_BENEATH and AT_BENEATH. Flags prevent open(2) and *at(2) vfs syscalls name lookup from escaping the starting directory. Supposedly the interface is similar to the same proposed Linux flags. Reviewed by: jilles (code, previous version of manpages), 0mp (manpages) Discussed with: allanjude, emaste, jonathan Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D17547
# c9e562b1	11-Sep-2018	Gordon Tetlow <gordon@FreeBSD.org>	Correct ELF header parsing code to prevent invalid ELF sections from disclosing memory. Submitted by: markj Reported by: Thomas Barabosch, Fraunhofer FKIE Approved by: re (implicit) Approved by: so Security: FreeBSD-SA-18:12.elf Security: CVE-2018-6924 Sponsored by: The FreeBSD Foundation
# b8d908b7	01-Jun-2018	Ed Maste <emaste@FreeBSD.org>	ANSIfy sys/kern
# b99aa0fb	29-May-2018	Matt Macy <mmacy@FreeBSD.org>	hwpmc: don't enter epoch section across mmap hook
# 161bf65f	24-Mar-2018	Konstantin Belousov <kib@FreeBSD.org>	In vn_io_fault1(), reduce the scope where pagefaults are disabled. Most important for the future use, do not call vm_fault_quick_hold_pages() with disabled pagefaults. Reported and tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 51369649	20-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.
# 03311f11	27-May-2017	Konstantin Belousov <kib@FreeBSD.org>	Use whole mnt_stat.f_fsid bits for st_dev. Since ino64 expanded dev_t to 64bit, make VOP_GETATTR(9) provide all bits of mnt_stat.f_fsid as va_fsid for vnodes on filesystems which use f_fsid. In particular, NFSv3 and sometimes NFSv4, and ZFS use this method or reporting st_dev by stat(2). Provide a new helper vn_fsid() to avoid duplicating code to copy f_fsid to va_fsid. Note that the change is mostly cosmetic. Its motivation is to avoid sign-extension of f_fsid[0] into 64bit dev_t value which happens after dev_t becomes 64bit.. Reviewed by: avg(zfs), rmacklem (nfs) (both for previous version) Sponsored by: The FreeBSD Foundation
# 69921123	23-May-2017	Konstantin Belousov <kib@FreeBSD.org>	Commit the 64-bit inode project. Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify struct dirent layout to add d_off, increase the size of d_fileno to 64-bits, increase the size of d_namlen to 16-bits, and change the required alignment. Increase struct statfs f_mntfromname[] and f_mntonname[] array length MNAMELEN to 1024. ABI breakage is mitigated by providing compatibility using versioned symbols, ingenious use of the existing padding in structures, and by employing other tricks. Unfortunately, not everything can be fixed, especially outside the base system. For instance, third-party APIs which pass struct stat around are broken in backward and forward incompatible ways. Kinfo sysctl MIBs ABI is changed in backward-compatible way, but there is no general mechanism to handle other sysctl MIBS which return structures where the layout has changed. It was considered that the breakage is either in the management interfaces, where we usually allow ABI slip, or is not important. Struct xvnode changed layout, no compat shims are provided. For struct xtty, dev_t tty device member was reduced to uint32_t. It was decided that keeping ABI compat in this case is more useful than reporting 64-bit dev_t, for the sake of pstat. Update note: strictly follow the instructions in UPDATING. Build and install the new kernel with COMPAT_FREEBSD11 option enabled, then reboot, and only then install new world. Credits: The 64-bit inode project, also known as ino64, started life many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick (mckusick) then picked up and updated the patch, and acted as a flag-waver. Feedback, suggestions, and discussions were carried by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles), and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial ports investigation followed by an exp-run by Antoine Brodin (antoine). Essential and all-embracing testing was done by Peter Holm (pho). The heavy lifting of coordinating all these efforts and bringing the project to completion were done by Konstantin Belousov (kib). Sponsored by: The FreeBSD Foundation (emaste, kib) Differential revision: https://reviews.freebsd.org/D10439
# 3e85b721	16-May-2017	Ed Maste <emaste@FreeBSD.org>	Remove register keyword from sys/ and ANSIfy prototypes A long long time ago the register keyword told the compiler to store the corresponding variable in a CPU register, but it is not relevant for any compiler used in the FreeBSD world today. ANSIfy related prototypes while here. Reviewed by: cem, jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D10193
# ecc6c515	19-Feb-2017	Konstantin Belousov <kib@FreeBSD.org>	Apply noexec mount option for mmap(PROT_EXEC). Right now the noexec mount option disallows image activators to try execve the files on the mount point. Also, after r127187, noexec also limits max_prot map entries permissions for mappings of files from such mounts, but not the actual mapping permissions. As result, the API behaviour is inconsistent. The files from noexec mount can be mapped with PROT_EXEC, but if mprotect(2) drops execution permission, it cannot be re-enabled later. Make this consistent logically and aligned with behaviour of other systems, by disallowing PROT_EXEC for mmap(2). Note that this change only ensures aligned results from mmap(2) and mprotect(2), it does not prevent actual code execution from files coming from noexec mount. Such files can always be read into anonymous executable memory and executed from there. Reported by: shamaz.mazum@gmail.com PR: 217062 Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 987ff181	12-Feb-2017	Konstantin Belousov <kib@FreeBSD.org>	Consistently handle negative or wrapping offsets in the mmap(2) syscalls. For regular files and posix shared memory, POSIX requires that [offset, offset + size) range is legitimate. At the maping time, check that offset is not negative. Allowing negative offsets might expose the data that filesystem put into vm_object for internal use, esp. due to OFF_TO_IDX() signess treatment. Fault handler verifies that the mapped range is valid, assuming that mmap(2) checked that arithmetic gives no undefined results. For device mappings, leave the semantic of negative offsets to the driver. Correct object page index calculation to not erronously propagate sign. In either case, disallow overflow of offset + size. Update mmap(2) man page to explain the requirement of the range validity, and behaviour when the range becomes invalid after mapping. Reported and tested by: royger (previous version) Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# e83a71c6	10-Feb-2017	Konstantin Belousov <kib@FreeBSD.org>	Fix r313495. The file type DTYPE_VNODE can be assigned as a fallback if VOP_OPEN() did not initialized file type. This is a typical code path used by normal file systems. Also, change error returned for inappropriate file type used for O_EXLOCK to EOPNOTSUPP, as declared in the open(2) man page. Reported by: cy, dhw, Iblis Lin <iblis@hs.ntnu.edu.tw> Tested by: dhw Sponsored by: The FreeBSD Foundation MFC after: 13 days
# e628e1b9	09-Feb-2017	Konstantin Belousov <kib@FreeBSD.org>	Increase a chance of devfs_close() calling d_close cdevsw method. If a file opened over a vnode has an advisory lock set at close, vn_closefile() acquires additional vnode use reference to prevent freeing the vnode in vn_close(). Side effect is that for device vnodes, devfs_close() sees that vnode reference count is greater than one and refuses to call d_close(). Create internal version of vn_close() which can avoid dropping the vnode reference if needed, and use this to execute VOP_CLOSE() without acquiring a new reference. Note that any parallel reference to the vnode would still prevent d_close call, if the reference is not from an opened file, e.g. due to stat(2). Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 7903b000	09-Feb-2017	Konstantin Belousov <kib@FreeBSD.org>	Do not establish advisory locks when doing open(O_EXLOCK) or open(O_SHLOCK) for files which do not have DTYPE_VNODE type. Both flock(2) and fcntl(2) syscalls refuse to acquire advisory lock on a file which type is not DTYPE_VNODE. Do the same when lock is requested from open(2). Restructure the block in vn_open_vnode() which handles O_EXLOCK and O_SHLOCK open flags to make it easier to quit its execution earlier with an error. Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# f1f7f1cb	27-Jan-2017	Mateusz Guzik <mjg@FreeBSD.org>	hwpmc: partially depessimize mmap handling if the module is not loaded In particular this means the pmc sx lock is no longer taken when an executable mapping succeeds. MFC after: 1 week
# 25c68168	22-Jan-2017	Konstantin Belousov <kib@FreeBSD.org>	More style cleanup. Use ANSI C definition for vn_closefile(). Switch to VNASSERT in _vn_lock(), simplify messages. Sponsored by: The FreeBSD Foundation X-MFC with: r312600, r312601, r312602, r312606
# eaf0969b	21-Jan-2017	Mateusz Guzik <mjg@FreeBSD.org>	vfs: fix LK_RETRY logic braino in r312600
# 829857c8	21-Jan-2017	Mateusz Guzik <mjg@FreeBSD.org>	vfs: __predict_false the need to handle F_HASLOCK Also reorder the check with DTYPE_VNODE. Passed files are vnodes vast majority of the time, so it is typically true.
# abbc538d	21-Jan-2017	Mateusz Guzik <mjg@FreeBSD.org>	vfs: fix whitespace damage in r312600 While here wrap the previously overly long line so that it fits 80 chars.
# 1091fb52	21-Jan-2017	Mateusz Guzik <mjg@FreeBSD.org>	vfs: refactor _vn_lock Stop testing for LK_RETRY and error multiple times. Also postpone the VI_DOOMED until after LK_RETRY was seen as it reads from the vnode. No functional changes.
# 69a28758	15-Sep-2016	Ed Maste <emaste@FreeBSD.org>	Renumber license clauses in sys/kern to avoid skipping #3
# c3c0088b	20-Aug-2016	Robert Watson <rwatson@FreeBSD.org>	Audit additional vnode information in the implementation of the ftruncate(2) system call. This was not required by the Common Criteria, which needed only open-time audit. MFC after: 2 weeks Sponsored by: DARPA, AFRL
# af326ace	25-Jul-2016	Conrad Meyer <cem@FreeBSD.org>	devfs: Move most ioctl logic down to vnode layer Devfs' file layer ioctl is now just a thin shim around the vnode layer. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D7286
# 971711fb	05-Jul-2016	Robert Watson <rwatson@FreeBSD.org>	Call audit hooks to capture vnode attributes for three file-descriptor method implementations: fstat(2), close(2), and poll(2). This change synchronises auditing here with similar auditing for VFS-specific system calls such as stat(2) that audit more complete vnode information. Sponsored by: DARPA, AFRL Approved by: re (kib) MFC after: 1 week
# 3f7ca894	17-May-2016	Konstantin Belousov <kib@FreeBSD.org>	Ensure that ftruncate(2) is performed synchronously when file is opened in O_SYNC mode, at least for UFS. This also handles truncation, done due to the O_SYNC \| O_TRUNC flags combination to open(2), in synchronous way. Noted by: bde Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 31b67320	29-Apr-2016	Pedro F. Giffuni <pfg@FreeBSD.org>	sys/kern: spelling fixes. Mostly on comments but affects some debug messages. MFC after: 2 weeks
# 74b8d63d	10-Apr-2016	Pedro F. Giffuni <pfg@FreeBSD.org>	Cleanup unnecessary semicolons from the kernel. Found with devel/coccinelle.
# 6adf1948	22-Jan-2016	Konstantin Belousov <kib@FreeBSD.org>	The struct file f_advice member is overlaid with the devfs f_cdevpriv data. If vnode bypass for devfs file failed, vn_read/vn_write are called and might try to dereference f_advice. Limit the accesses to f_advice to VREG vnodes only, which is the type ensured by posix_fadvise(). The f_advice for regular files is protected by mtxpool lock. Recheck that f_advice is not NULL after lock is taken. Reported and tested by: bde Sponsored by: The FreeBSD Foundation MFC after: 3 weeks
# ce958bde	17-Jan-2016	Konstantin Belousov <kib@FreeBSD.org>	When cleaning up from failed adv locking and checking for write, do not call VOP_CLOSE() manually. Instead, delegate the close to fo_close() performed as part of the fdrop() on the file failed to open. For this, finish constructing file on error, in particular, set f_vnode and f_ops. Forcibly resetting f_ops to badfileops disabled additional cleanups performed by fo_close() for some file types, in this case it was noted that cdevpriv data was corrupted. Since fo_close() call must be enabled for some file types, it makes more sense to enable it for all files opened through vn_open_cred(). In collaboration with: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 78e79434	08-Oct-2015	Fabien Thomas <fabient@FreeBSD.org>	Fix r283998 that broke mapin events for hwpmc. Reviewed by: jhb Sponsored by: Stormshield
# 3138cd36	30-Sep-2015	Mark Johnston <markj@FreeBSD.org>	As a step towards the elimination of PG_CACHED pages, rework the handling of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to the head of the inactive queue instead of being cached. This affects the implementation of POSIX_FADV_NOREUSE as well, since it works by applying POSIX_FADV_DONTNEED to file ranges after they have been read or written. At that point the corresponding buffers may still be dirty, so the previous implementation would coalesce successive ranges and apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the dirty buffers would eventually be cached. To preserve this behaviour in an efficient manner, this change adds a new buf flag, B_NOREUSE, which causes the pages backing a VMIO buf to be placed at the head of the inactive queue when the buf is released. POSIX_FADV_NOREUSE then works by setting this flag in bufs that underlie the specified range. Reviewed by: alc, kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3726
# 9e18c9eb	09-Sep-2015	Konstantin Belousov <kib@FreeBSD.org>	For open("name", O_DIRECTORY \| O_CREAT), do not try to create the named node, open(2) cannot create directories. But do allow the flag combination to succeed if the directory already exists. Declare the open("name", O_DIRECTORY \| O_CREAT \| O_EXCL) always invalid for the same reason, since open(2) cannot create directory. Note that there is an argument that O_DIRECTORY \| O_CREAT should be invalid always, regardless of the target directory existence or O_EXCL. The current fix is conservative and allows the call to succeed in the situation where it succeeded before the patch. Reported by: Tom Ridge <freebsd@tom-ridge.com> Reviewed by: rwatson PR: 202892 Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 14bdbaf2	03-Sep-2015	Conrad Meyer <cem@FreeBSD.org>	Detect badly behaved coredump note helpers Coredump notes depend on being able to invoke dump routines twice; once in a dry-run mode to get the size of the note, and another to actually emit the note to the corefile. When a note helper emits a different length section the second time around than the length it requested the first time, the kernel produces a corrupt coredump. NT_PROCSTAT_FILES output length, when packing kinfo structs, is tied to the length of filenames corresponding to vnodes in the process' fd table via vn_fullpath. As vnodes may move around during dump, this is racy. So: - Detect badly behaved notes in putnote() and pad underfilled notes. - Add a fail point, debug.fail_point.fill_kinfo_vnode__random_path to exercise the NT_PROCSTAT_FILES corruption. It simply picks random lengths to expand or truncate paths to in fo_fill_kinfo_vnode(). - Add a sysctl, kern.coredump_pack_fileinfo, to allow users to disable kinfo packing for PROCSTAT_FILES notes. This should avoid both FILES note corruption and truncation, even if filenames change, at the cost of about 1 kiB in padding bloat per open fd. Document the new sysctl in core.5. - Fix note_procstat_files to self-limit in the 2nd pass. Since sometimes this will result in a short write, pad up to our advertised size. This addresses note corruption, at the risk of sometimes truncating the last several fd info entries. - Fix NT_PROCSTAT_FILES consumers libutil and libprocstat to grok the zero padding. With suggestions from: bjk, jhb, kib, wblock Approved by: markj (mentor) Relnotes: yes Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3548
# 89177288	30-Jul-2015	Konstantin Belousov <kib@FreeBSD.org>	vn_io_fault() handling of the LOR for i/o into the file-backed buffers has observable overhead when the buffer pages are not resident or not mapped. The overhead comes at least from two factors, one is the additional work needed to detect the situation, prepare and execute the rollbacks. Another is the consequence of the i/o splitting into the batches of the held pages, causing filesystems see series of the smaller i/o requests instead of the single large request. Note that expected case of the resident i/o buffer does not expose these issues. Provide a prefaulting for the userspace i/o buffers, disabled by default. I am careful of not enabling prefaulting by default for now, since it would be detrimental for the applications which speculatively pass extra-large buffers of anonymous memory to not deal with buffer sizing (if such apps exist). Found and tested by: bde, emaste Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 5f34e93c	05-Jul-2015	Mark Johnston <markj@FreeBSD.org>	Check suspendability on the mountpoint returned by VOP_GETWRITEMOUNT. This obviates the need for a MNTK_SUSPENDABLE flag, since passthrough filesystems like nullfs and unionfs no longer need to inherit this information from their lower layer(s). This change also restores the pre-r273336 behaviour of using the presence of a susp_clean VFS method to request suspension support. Reviewed by: kib, mjg Differential Revision: https://reviews.freebsd.org/D2937
# f6f6d240	10-Jun-2015	Mateusz Guzik <mjg@FreeBSD.org>	Implement lockless resource limits. Use the same scheme implemented to manage credentials. Code needing to look at process's credentials (as opposed to thred's) is provided with *_proc variants of relevant functions. Places which possibly had to take the proc lock anyway still use the proc pointer to access limits.
# 7077c426	04-Jun-2015	John Baldwin <jhb@FreeBSD.org>	Add a new file operations hook for mmap operations. File type-specific logic is now placed in the mmap hook implementation rather than requiring it to be placed in sys/vm/vm_mmap.c. This hook allows new file types to support mmap() as well as potentially allowing mmap() for existing file types that do not currently support any mapping. The vm_mmap() function is now split up into two functions. A new vm_mmap_object() function handles the "back half" of vm_mmap() and accepts a referenced VM object to map rather than a (handle, handle_type) tuple. vm_mmap() is now reduced to converting a (handle, handle_type) tuple to a a VM object and then calling vm_mmap_object() to handle the actual mapping. The vm_mmap() function remains for use by other parts of the kernel (e.g. device drivers and exec) but now only supports mapping vnodes, character devices, and anonymous memory. The mmap() system call invokes vm_mmap_object() directly with a NULL object for anonymous mappings. For mappings using a file descriptor, the descriptors fo_mmap() hook is invoked instead. The fo_mmap() hook is responsible for performing type-specific checks and adjustments to arguments as well as possibly modifying mapping parameters such as flags or the object offset. The fo_mmap() hook routines then call vm_mmap_object() to handle the actual mapping. The fo_mmap() hook is optional. If it is not set, then fo_mmap() will fail with ENODEV. A fo_mmap() hook is implemented for regular files, character devices, and shared memory objects (created via shm_open()). While here, consistently use the VM_PROT_* constants for the vm_prot_t type for the 'prot' variable passed to vm_mmap() and vm_mmap_object() as well as the vm_mmap_vnode() and vm_mmap_cdev() helper routines. Previously some places were using the mmap()-specific PROT_* constants instead. While this happens to work because PROT_xx == VM_PROT_xx, using VM_PROT_* is more correct. Differential Revision: https://reviews.freebsd.org/D2658 Reviewed by: alc (glanced over), kib MFC after: 1 month Sponsored by: Chelsio
# 2db0e1f5	27-May-2015	Konstantin Belousov <kib@FreeBSD.org>	Add V_MNTREF flag to the vn_start_write(9) and vn_start_secondary_write(9) functions. The flag indicates that the caller already owns a reference on the mount point, and the functions can consume it. The reference is released by vn_finished_write(9) and vn_finished_secondary_write(9) in due course. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# d5fec489	21-Apr-2015	Craig Rodrigues <rodrigc@FreeBSD.org>	Support file verification in MAC. * Add VCREAT flag to indicate when a new file is being created * Add VVERIFY to indicate verification is required * Both VCREAT and VVERIFY are only passed on the MAC method vnode_check_open and are removed from the accmode after * Add O_VERIFY flag to rtld open of objects * Add 'v' flag to __sflags to set O_VERIFY flag. Submitted by: Steve Kiernan <stevek@juniper.net> Obtained from: Juniper Networks, Inc. GitHub Pull Request: https://github.com/freebsd/freebsd/pull/27 Relnotes: yes
# 8ee9765a	21-Dec-2014	Konstantin Belousov <kib@FreeBSD.org>	Add VN_OPEN_NAMECACHE flag for vn_open_cred(9), which requests that the created file name was cached. Use the flag for core dumps. Requested by: rpaulo Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 6c21f6ed	18-Dec-2014	Konstantin Belousov <kib@FreeBSD.org>	The VOP_LOOKUP() implementations for CREATE op do not put the name into namecache, to avoid cache trashing when doing large operations. E.g., tar archive extraction is not usually followed by access to many of the files created. Right now, each VOP_LOOKUP() implementation explicitely knowns about this quirk and tests for both MAKEENTRY flag presence and op != CREATE to make the call to cache_enter(). Centralize the handling of the quirk into VFS, by deciding to cache only by MAKEENTRY flag in VOP. VFS now sets NOCACHE flag for CREATE namei() calls. Note that the change in semantic is backward-compatible and could be merged to the stable branch, and is compatible with non-changed third-party filesystems which correctly handle MAKEENTRY. Suggested by: Chris Torek <torek@pi-coral.com> Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 0061ddb3	13-Dec-2014	Konstantin Belousov <kib@FreeBSD.org>	Only sleep interruptible while waiting for suspension end when filesystem specified VFCF_SBDRY flag, i.e. for NFS. There are two issues with the sleeps. First, applications may get unexpected EINTR from the disk i/o syscalls. Second, interruptible sleep allows the stop of the process, and since mount point is referenced while thread sleeps, unmount cannot free mount point structure' memory, blocking unmount indefinitely. Even for NFS, it is probably only reasonable to enable PCATCH for intr mounts, but this information is currently not available at VFS level. Reported and tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 5fab60a0	14-Nov-2014	Konstantin Belousov <kib@FreeBSD.org>	In vfs_write_suspend_umnt(), if suspension cannot be established, do not forget to restore write ops count when returning the error. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 4fce16e4	20-Oct-2014	Mateusz Guzik <mjg@FreeBSD.org>	Provide vfs suspension support only for filesystems which need it, take two. nullfs and unionfs need to request suspension if underlying filesystem(s) use it. Utilize mnt_kern_flag for this purpose. This is a fixup for 273271. No strong objections from: kib Pointy hat to: mjg MFC after: 2 weeks
# 020b8f17	19-Oct-2014	Mateusz Guzik <mjg@FreeBSD.org>	Provide vfs suspension support only for filesystems which need it. Need is expressed by providing vfs_susp_clean function in vfsops. Differential Revision: D952 Reviewed by: kib (previous version) MFC after: 2 weeks
# 4142462e	04-Oct-2014	Konstantin Belousov <kib@FreeBSD.org>	Slightly reword comment. Move code, which is described by the comment, after it. Discussed with: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week
# e3d6fece	04-Oct-2014	Konstantin Belousov <kib@FreeBSD.org>	Add IO_RANGELOCKED flag for vn_rdwr(9), which specifies that vnode is not locked, but range is. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 9696feeb	22-Sep-2014	John Baldwin <jhb@FreeBSD.org>	Add a new fo_fill_kinfo fileops method to add type-specific information to struct kinfo_file. - Move the various fill_*_info() methods out of kern_descrip.c and into the various file type implementations. - Rework the support for kinfo_ofile to generate a suitable kinfo_file object for each file and then convert that to a kinfo_ofile structure rather than keeping a second, different set of code that directly manipulates type-specific file information. - Remove the shm_path() and ksem_info() layering violations. Differential Revision: https://reviews.freebsd.org/D775 Reviewed by: kib, glebius (earlier version)
# 037755fd	26-Aug-2014	Mateusz Guzik <mjg@FreeBSD.org>	Fix up races with f_seqcount handling. It was possible that the kernel would overwrite user-supplied hint. Abuse vnode lock for this purpose. In collaboration with: kib MFC after: 1 week
# 895b3782	14-Jul-2014	Konstantin Belousov <kib@FreeBSD.org>	Extract the code to put a filesystem into the suspended state (at the unmount time) in the helper vfs_write_suspend_umnt(). Use it instead of two inline copies in FFS. Fix the bug in the FFS unmount, when suspension failed, the ufs extattrs were not reinitialized. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# a6945216	14-Jul-2014	Konstantin Belousov <kib@FreeBSD.org>	Generalize vn_get_ino() to allow filesystems to use custom vnode producer, instead of hard-coding VFS_VGET(). New function, which takes callback, is called vn_get_ino_gen(), standard callback for vn_get_ino() is provided. Convert inline copies of vn_get_ino() in msdosfs and cd9660 into the uses of vn_get_ino_gen(). Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 7b81a399	17-Jun-2014	Konstantin Belousov <kib@FreeBSD.org>	In msdosfs_setattr(), add a check for result of the utimes(2) permissions test, forgotten in r164033. Refactor the permission checks for utimes(2) into vnode helper function vn_utimes_perm(9), and simplify its code comparing with the UFS origin, by writing the call to VOP_ACCESSX only once. Use the helper for UFS(5), tmpfs(5), devfs(5) and msdosfs(5). Reported by: bde Reviewed by: bde, trasz Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 2e501b0a	14-Jun-2014	Konstantin Belousov <kib@FreeBSD.org>	Use vn_io_fault for the writes from core dumping code. Recursing into VM due to copyin(9) faulting while VFS locks are held is deadlock-prone there in the same way as for the write(2) syscall. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 6f2b769c	15-Mar-2014	John-Mark Gurney <jmg@FreeBSD.org>	change td_retval into a union w/ off_t, with defines to mask the change... This eliminates a cast, and also forces td_retval (often 2 32-bit registers) to be aligned so that off_t's can be stored there on arches with strict alignment requirements like armeb (AVILA)... On i386, this doesn't change alignment, and on amd64 it doesn't either, as register_t is already 64bits... This will also prevent future breakage due to people adding additional fields to the struct... This gets AVILA booting a bit farther... Reviewed by: bde
# 65f05eeb	17-Dec-2013	Konstantin Belousov <kib@FreeBSD.org>	If vn_open_vnode() succeeded in opening the vnode, but subsequent advisory lock cannot be obtained, prevent double-close of the vnode in vn_close() called from the fdrop(), by resetting file' f_ops methods. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 7e14088d	20-Nov-2013	Konstantin Belousov <kib@FreeBSD.org>	Revert back to use int for the page counts. In vn_io_fault(), the i/o is chunked to pieces limited by integer io_hold_cnt tunable, while vm_fault_quick_hold_pages() takes integer max_count as the upper bound. Rearrange the checks to correctly handle overflowing address arithmetic. Submitted by: bde Tested by: pho Discussed with: alc MFC after: 1 week
# d005ed53	12-Nov-2013	Konstantin Belousov <kib@FreeBSD.org>	Avoid overflow for the page counts. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 6272798a	09-Nov-2013	Konstantin Belousov <kib@FreeBSD.org>	Both vn_close() and VFS_PROLOGUE() evaluate vp->v_mount twice, without holding the vnode lock; vp->v_mount is checked first for NULL equiality, and then dereferenced if not NULL. If vnode is reclaimed meantime, second dereference would still give NULL. Change VFS_PROLOGUE() to evaluate the mp once, convert MNTK_SHARED_WRITES and MNTK_EXTENDED_SHARED tests into inline functions. Reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 9bec6325	13-Sep-2013	Konstantin Belousov <kib@FreeBSD.org>	When opening or closing fifo, ensure that the vnode is locked exclusively. Filesystems are assumed to disable shared locking for the fifo vnode locks, but some do not. Reported and tested by: olgeni Discussed with: avg Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (glebius)
# c0a46535	21-Aug-2013	Konstantin Belousov <kib@FreeBSD.org>	Make the seek a method of the struct fileops. Tested by: pho Sponsored by: The FreeBSD Foundation
# b1dd38f4	16-Aug-2013	Konstantin Belousov <kib@FreeBSD.org>	Restore the previous sendfile(2) behaviour on the block devices. Provide valid .fo_sendfile method for several missed struct fileops. Reviewed by: glebius Sponsored by: The FreeBSD Foundation
# ca04d21d	15-Aug-2013	Gleb Smirnoff <glebius@FreeBSD.org>	Make sendfile() a method in the struct fileops. Currently only vnode backed file descriptors have this method implemented. Reviewed by: kib Sponsored by: Nginx, Inc. Sponsored by: Netflix
# cc3d8c35	09-Jul-2013	Konstantin Belousov <kib@FreeBSD.org>	There are several code sequences like vfs_busy(mp); vfs_write_suspend(mp); which are problematic if other thread starts unmount between two calls. The unmount starts a write, while vfs_write_suspend() drain writers. On the other hand, unmount drains busy references, causing the deadlock. Add a flag argument to vfs_write_suspend and require the callers of it to specify VS_SKIP_UNMOUNT flag, when the call is performed not in the mount path, i.e. the covered vnode is not locked. The suspension is not attempted if VS_SKIP_UNMOUNT is specified and unmount is in progress. Reported and tested by: Andreas Longwitz <longwitz@incore.de> Sponsored by: The FreeBSD Foundation MFC after: 3 weeks
# 3d4c503c	31-May-2013	John Baldwin <jhb@FreeBSD.org>	Style fixes to vn_ioctl(). Suggested by: bde
# dfa66c01	03-May-2013	John Baldwin <jhb@FreeBSD.org>	Fix FIONREAD on regular files. The computed result was being ignored and it was being passed down to VOP_IOCTL() where it promptly resulted in ENOTTY due to a missing else for the past 8 years. While here, use a shared vnode lock while fetching the current file's size. MFC after: 1 week
# 926cd204	30-Mar-2013	Matthew D Fleming <mdf@FreeBSD.org>	Use a shared lock for VOP_GETEXTATTR, as it is a read-like operation. MFC after: 1 week
# aed5a114	15-Mar-2013	Konstantin Belousov <kib@FreeBSD.org>	Separate the copyright lines and the informational block by a blank line. Requested by: joel MFC after: 2 weeks
# 5791cee8	14-Mar-2013	Konstantin Belousov <kib@FreeBSD.org>	Add my copyright for the 2012 year work, in particular vn_io_fault() and f_offset locking. Add required Foundation notice for r248319. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 5f5f0554	15-Mar-2013	Konstantin Belousov <kib@FreeBSD.org>	Implement the helper function vn_io_fault_pgmove(), intended to use by the filesystem VOP_READ() and VOP_WRITE() implementations in the same way as vn_io_fault_uiomove() over the unmapped buffers. Helper provides the convenient wrapper over the pmap_copy_pages() for struct uio consumers, taking care of the TDP_UIOHELD situations. Sponsored by: The FreeBSD Foundation Tested by: pho MFC after: 2 weeks
# 89f6b863	08-Mar-2013	Attilio Rao <attilio@FreeBSD.org>	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
# 71ac38e8	01-Mar-2013	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Remove unnecessary variables.
# d7ffa248	15-Feb-2013	Sergey Kandaurov <pluknet@FreeBSD.org>	vn_io_faults_cnt: - use u_long consistently - use SYSCTL_ULONG to match the type of variable Reviewed by: kib MFC after: 1 week
# 9e2677fd	31-Jan-2013	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Simplify code a bit. This is leftover after Giant removal from VFS.
# ddd6b3fc	10-Jan-2013	Konstantin Belousov <kib@FreeBSD.org>	Add flags argument to vfs_write_resume() and remove vfs_write_resume_flags(). Sponsored by: The FreeBSD Foundation
# f99cb34c	01-Jan-2013	Konstantin Belousov <kib@FreeBSD.org>	The process_deferred_inactive() function locks the vnodes of the ufs mount, which means that is must not be called while the snaplock is owned. The vfs_write_resume(9) does call the function as the VFS_SUSP_CLEAN() method, which is too early and falls into the region still protected by snaplock. Add yet another flag for the vfs_write_resume_flags() to avoid calling suspension cleanup handler after the suspend is lifted, and use it in the ffs_snapshot() call to vfs_write_resume. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 91e94745	28-Dec-2012	Konstantin Belousov <kib@FreeBSD.org>	Make it possible to atomically resume writes on the mount and account the write start, by adding a variation of the vfs_write_resume(9) which accepts flags. Use the new function to prevent a deadlock between parallel suspension and snapshotting a UFS mount. The ffs_snapshot() code performed vfs_write_resume() followed by vn_start_write() while owning the snaplock. If the suspension intervene between resume and vn_start_write(), the deadlock occured after the suspending thread tried to lock the snaplock, most typically during the write in the ffs_copyonwrite(). Reported and tested by: Andreas Longwitz <longwitz@incore.de> Reviewed by: mckusick MFC after: 2 weeks X-MFC-note: make the vfs_write_resume(9) function a macro after the MFC, in HEAD
# f121e3e8	27-Nov-2012	Pawel Jakub Dawidek <pjd@FreeBSD.org>	- Add NOCAPCHECK flag to namei that allows lookup to work even if the process is in capability mode. - Add VN_OPEN_NOCAPCHECK flag for vn_open_cred() to will ne converted into NOCAPCHECK namei flag. This functionality will be used to enable core dumps for sandboxed processes. Reviewed by: rwatson Obtained from: WHEEL Systems MFC after: 2 weeks
# 140dedb8	02-Nov-2012	Konstantin Belousov <kib@FreeBSD.org>	The r241025 fixed the case when a binary, executed from nullfs mount, was still possible to open for write from the lower filesystem. There is a symmetric situation where the binary could already has file descriptors opened for write, but it can be executed from the nullfs overlay. Handle the issue by passing one v_writecount reference to the lower vnode if nullfs vnode has non-zero v_writecount. Note that only one write reference can be donated, since nullfs only keeps one use reference on the lower vnode. Always use the lower vnode v_writecount for the checks. Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT to manipulate the v_writecount value, which manages a single bypass reference to the lower vnode. Caling the VOPs instead of directly accessing v_writecount provide the fix described in the previous paragraph. Tested by: pho MFC after: 3 weeks
# 5050aa86	22-Oct-2012	Konstantin Belousov <kib@FreeBSD.org>	Remove the support for using non-mpsafe filesystem modules. In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems. The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes. Conducted and reviewed by: attilio Tested by: pho
# 877d24ac	28-Sep-2012	Konstantin Belousov <kib@FreeBSD.org>	Fix the mis-handling of the VV_TEXT on the nullfs vnodes. If you have a binary on a filesystem which is also mounted over by nullfs, you could execute the binary from the lower filesystem, or from the nullfs mount. When executed from lower filesystem, the lower vnode gets VV_TEXT flag set, and the file cannot be modified while the binary is active. But, if executed as the nullfs alias, only the nullfs vnode gets VV_TEXT set, and you still can open the lower vnode for write. Add a set of VOPs for the VV_TEXT query, set and clear operations, which are correctly bypassed to lower vnode. Tested by: pho (previous version) MFC after: 2 weeks
# 8c706ce0	25-Sep-2012	Pawel Jakub Dawidek <pjd@FreeBSD.org>	vn_write() always expects FOF_OFFSET flag, which is asserted at the begining, so there is no need to check for it. Sponsored by: FreeBSD Foundation MFC after: 2 weeks
# e838f09c	31-Jul-2012	John Baldwin <jhb@FreeBSD.org>	Reorder the managament of advisory locks on open files so that the advisory lock is obtained before the write count is increased during open() and the lock is released after the write count is decreased during close(). The first change closes a race where an open() that will block with O_SHLOCK or O_EXLOCK can increase the write count while it waits. If the process holding the current lock on the file then tries to call exec() on the file it has locked, it can fail with ETXTBUSY even though the advisory lock is preventing other threads from succesfully completeing a writable open(). The second change closes a race where a read-only open() with O_SHLOCK or O_EXLOCK may return successfully while the write count is non-zero due to another descriptor that had the advisory lock and was blocking the open() still being in the process of closing. If the process that completed the open() then attempts to call exec() on the file it locked, it can fail with ETXTBUSY even though the other process that held a write lock has closed the file and released the lock. Reviewed by: kib MFC after: 1 month
# c5c1199c	02-Jul-2012	Konstantin Belousov <kib@FreeBSD.org>	Extend the KPI to lock and unlock f_offset member of struct file. It now fully encapsulates all accesses to f_offset, and extends f_offset locking to other consumers that need it, in particular, to lseek() and variants of getdirentries(). Ensure that on 32bit architectures f_offset, which is 64bit quantity, always read and written under the mtxpool protection. This fixes apparently easy to trigger race when parallel lseek()s or lseek() and read/write could destroy file offset. The already broken ABI emulations, including iBCS and SysV, are not converted (yet). Tested by: pho No objections from: jhb MFC after: 3 weeks
# 854c3ce7	21-Jun-2012	Konstantin Belousov <kib@FreeBSD.org>	Fix locking for f_offset, vn_read() and vn_write() cases only, for now. It seems that intended locking protocol for struct file f_offset field was as follows: f_offset should always be changed under the vnode lock (except fcntl(2) and lseek(2) did not followed the rules). Since read(2) uses shared vnode lock, FOFFSET_LOCKED block is additionally taken to serialize shared vnode lock owners. This was broken first by enabling shared lock on writes, then by fadvise changes, which moved f_offset assigned from under vnode lock, and last by vn_io_fault() doing chunked i/o. More, due to uio_offset not yet valid in vn_io_fault(), the range lock for reads was taken on the wrong region. Change the locking for f_offset to always use FOFFSET_LOCKED block, which is placed before rangelocks in the lock order. Extract foffset_lock() and foffset_unlock() functions which implements FOFFSET_LOCKED lock, and consistently lock f_offset with it in the vn_io_fault() both for reads and writes, even if MNTK_NO_IOPF flag is not set for the vnode mount. Indicate that f_offset is already valid for vn_read() and vn_write() calls from vn_io_fault() with FOF_OFFSET flag, and assert that all callers of vn_read() and vn_write() follow this protocol. Extract get_advice() function to calculate the POSIX_FADV_XXX value for the i/o region, and use it were appropriate. Reviewed by: jhb Tested by: pho MFC after: 2 weeks
# cd4ecf3c	19-Jun-2012	John Baldwin <jhb@FreeBSD.org>	Further refine the implementation of POSIX_FADV_NOREUSE. First, extend the changes in r230782 to better handle the common case of using NOREUSE with sequential reads. A NOREUSE file descriptor will now track the last implicit DONTNEED request it made as a result of a NOREUSE read. If a subsequent NOREUSE read is adjacent to the previous range, it will apply the DONTNEED request to the entire range of both the previous read and the current read. The effect is that each read of a file accessed sequentially will apply the DONTNEED request to the entire range that has been read. This allows NOREUSE to properly handle misaligned reads by flushing each buffer to cache once it has been completely read. Second, apply the same changes made to read(2) by r230782 and this change to writes. This provides much better performance in the sequential write case as it allows writes to still be clustered. It also provides much better performance for misaligned writes. It does mean that NOREUSE will be generally ineffective for non-sequential writes as the current implementation relies on a future NOREUSE write's implicit DONTNEED request to flush the dirty buffer from the current write. MFC after: 2 weeks
# 7ac1b61a	08-Jun-2012	John Baldwin <jhb@FreeBSD.org>	Split the second half of vn_open_cred() (after a vnode has been found via a lookup or created via VOP_CREATE()) into a new vn_open_vnode() function and use this function in fhopen() instead of duplicating code from vn_open_cred() directly. Tested by: pho Reviewed by: kib MFC after: 2 weeks
# bba08085	03-Jun-2012	Konstantin Belousov <kib@FreeBSD.org>	Add a knob to disable vn_io_fault. MFC after: 1 month
# bb2f52a6	03-Jun-2012	Konstantin Belousov <kib@FreeBSD.org>	Count and export the number of prefaulting happen. MFC after: 1 month
# 41014d99	30-May-2012	Konstantin Belousov <kib@FreeBSD.org>	vn_io_fault() is a facility to prevent page faults while filesystems perform copyin/copyout of the file data into the usermode buffer. Typical filesystem hold vnode lock and some buffer locks over the VOP_READ() and VOP_WRITE() operations, and since page fault handler may need to recurse into VFS to get the page content, a deadlock is possible. The facility works by disabling page faults handling for the current thread and attempting to execute i/o while allowing uiomove() to access the usermode mapping of the i/o buffer. If all buffer pages are resident, uiomove() is successfull and request is finished. If EFAULT is returned from uiomove(), the pages backing i/o buffer are faulted in and held, and the copyin/out is performed using uiomove_fromphys() over the held pages for the second attempt of VOP call. Since pages are hold in chunks to prevent large i/o requests from starving free pages pool, and since vnode lock is only taken for i/o over the current chunk, the vnode lock no longer protect atomicity of the whole i/o request. Use newly added rangelocks to provide the required atomicity of i/o regardind other i/o and truncations. Filesystems need to explicitely opt-in into the scheme, by setting the MNTK_NO_IOPF struct mount flag, and optionally by using vn_io_fault_uiomove(9) helper which takes care of calling uiomove() or converting uio into request for uiomove_fromphys(). Reviewed by: bf (comments), mdf, pjd (previous version) Tested by: pho Tested by: flo, Gustau P?rez <gperez entel upc edu> (previous version) MFC after: 2 months
# 292520f7	25-May-2012	Konstantin Belousov <kib@FreeBSD.org>	Add a vn_bmap_seekhole(9) vnode helper which can be used by any filesystem which supports VOP_BMAP(9) to implement SEEK_HOLE/SEEK_DATA commands for lseek(2). MFC after: 2 weeks
# b47f6241	08-Mar-2012	John Baldwin <jhb@FreeBSD.org>	Add KTR_VFS traces to track modifications to a vnode's writecount.
# 526d0bd5	20-Feb-2012	Konstantin Belousov <kib@FreeBSD.org>	Fix found places where uio_resid is truncated to int. Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from the usermode. Discussed with: bde, das (previous versions) MFC after: 1 month
# 2bd3e4c2	30-Jan-2012	John Baldwin <jhb@FreeBSD.org>	Refine the implementation of POSIX_FADV_NOREUSE for the read(2) case such that instead of using direct I/O it allows read-ahead similar to POSIX_FADV_NORMAL, but invokes VOP_ADVISE(POSIX_FADV_DONTNEED) after the read(2) has completed to purge just-read data. The write(2) path continues to use direct I/O for POSIX_FADV_NOREUSE for now. Note that NOREUSE works optimally if an application reads and writes full fs blocks.
# 936c09ac	03-Nov-2011	John Baldwin <jhb@FreeBSD.org>	Add the posix_fadvise(2) system call. It is somewhat similar to madvise(2) except that it operates on a file descriptor instead of a memory region. It is currently only supported on regular files. Just as with madvise(2), the advice given to posix_fadvise(2) can be divided into two types. The first type provide hints about data access patterns and are used in the file read and write routines to modify the I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are thus filesystem independent. Note that to ease implementation (and since this API is only advisory anyway), only a single non-normal range is allowed per file descriptor. The second type of hints are used to hint to the OS that data will or will not be used. These hints are implemented via a new VOP_ADVISE(). A default implementation is provided which does nothing for the WILLNEED request and attempts to move any clean pages to the cache page queue for the DONTNEED request. This latter case required two other changes. First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests vinvalbuf() to only flush clean buffers for the vnode from the buffer cache and to not remove any backing pages from the vnode. This is used to ensure clean pages are not wired into the buffer cache before attempting to move them to the cache page queue. The second change adds a new vm_object_page_cache() method. This method is somewhat similar to vm_object_page_remove() except that instead of freeing each page in the specified range, it attempts to move clean pages to the cache queue if possible. To preserve the ABI of struct file, the f_cdevpriv pointer is now reused in a union to point to the currently active advice region if one is present for regular files. Reviewed by: jilles, kib, arch@ Approved by: re (kib) MFC after: 1 month
# 8451d0dd	16-Sep-2011	Kip Macy <kmacy@FreeBSD.org>	In order to maximize the re-usability of kernel code in user space this patch modifies makesyscalls.sh to prefix all of the non-compatibility calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel entry points and all places in the code that use them. It also fixes an additional name space collision between the kernel function psignal and the libc function of the same name by renaming the kernel psignal kern_psignal(). By introducing this change now we will ease future MFCs that change syscalls. Reviewed by: rwatson Approved by: re (bz)
# 82378711	25-Aug-2011	Martin Matuska <mm@FreeBSD.org>	Generalize ffs_pages_remove() into vn_pages_remove(). Remove mapped pages for all dataset vnodes in zfs_rezget() using new vn_pages_remove() to fix mmapped files changed by zfs rollback or zfs receive -F. PR: kern/160035, kern/156933 Reviewed by: kib, pjd Approved by: re (kib) MFC after: 1 week
# 9c00bb91	16-Aug-2011	Konstantin Belousov <kib@FreeBSD.org>	Add the fo_chown and fo_chmod methods to struct fileops and use them to implement fchown(2) and fchmod(2) support for several file types that previously lacked it. Add MAC entries for chown/chmod done on posix shared memory and (old) in-kernel posix semaphores. Based on the submission by: glebius Reviewed by: rwatson Approved by: re (bz)
# 3d08a76b	12-May-2011	Matthew D Fleming <mdf@FreeBSD.org>	Use a name instead of a magic number for kern_yield(9) when the priority should not change. Fetch the td_user_pri under the thread lock. This is probably not necessary but a magic number also seems preferable to knowing the implementation details here. Requested by: Jason Behmer < jason DOT behmer AT isilon DOT com >
# e7ceb1e9	07-Feb-2011	Matthew D Fleming <mdf@FreeBSD.org>	Based on discussions on the svn-src mailing list, rework r218195: - entirely eliminate some calls to uio_yeild() as being unnecessary, such as in a sysctl handler. - move should_yield() and maybe_yield() to kern_synch.c and move the prototypes from sys/uio.h to sys/proc.h - add a slightly more generic kern_yield() that can replace the functionality of uio_yield(). - replace source uses of uio_yield() with the functional equivalent, or in some cases do not change the thread priority when switching. - fix a logic inversion bug in vlrureclaim(), pointed out by bde@. - instead of using the per-cpu last switched ticks, use a per thread variable for should_yield(). With PREEMPTION, the only reasonable use of this is to determine if a lock has been held a long time and relinquish it. Without PREEMPTION, this is essentially the same as the per-cpu variable.
# a7d5f7eb	19-Oct-2010	Jamie Gritton <jamie@FreeBSD.org>	A new jail(8) with a configuration file, to replace the work currently done by /etc/rc.d/jail.
# 3297cdd0	26-Jun-2010	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Correct arguments order.
# 77dda2b9	06-May-2010	Edward Tomasz Napierala <trasz@FreeBSD.org>	Avoid overflow. Submitted by: bde@
# 307d88b7	06-May-2010	Edward Tomasz Napierala <trasz@FreeBSD.org>	Style fixes and removal of unneeded variable. Submitted by: bde@
# b5f770bd	05-May-2010	Edward Tomasz Napierala <trasz@FreeBSD.org>	Move checking against RLIMIT_FSIZE into one place, vn_rlimit_fsize(). Reviewed by: kib
# 0866329b	17-Apr-2010	Andriy Gapon <avg@FreeBSD.org>	MFC r206129: vn_stat: use va_blocksize when setting st_blksize
# 364b8a7b	03-Apr-2010	Andriy Gapon <avg@FreeBSD.org>	vn_stat: take into account va_blocksize when setting st_blksize As currently st_blksize is always PAGE_SIZE, it is playing safe to not use any smaller value. For some cases this might not be optimal, but at least nothing should get broken. Generally I don't expect this commit to change much for the following reasons (in case of VREG, VDIR): - application I/O and physical I/O are sufficiently decoupled by filesystem code, buffer cache code, cluster and read-ahead logic - not all applications use st_blksize as a hint, some use f_iosize, some use fixed block sizes I expect writes to the middle of files on ZFS to benefit the most from this change. Silence from: fs@ MFC after: 2 weeks
# 510ea843	28-Mar-2010	Ed Schouten <ed@FreeBSD.org>	Rename st_timespec fields to st_tim for POSIX 2008 compliance. A nice thing about POSIX 2008 is that it finally standardizes a way to obtain file access/modification/change times in sub-second precision, namely using struct timespec, which we already have for a very long time. Unfortunately POSIX uses different names. This commit adds compatibility macros, so existing code should still build properly. Also change all source code in the kernel to work without any of the compatibility macros. This makes it all a less ambiguous. I am also renaming st_birthtime to st_birthtim, even though it was a local extension anyway. It seems Cygwin also has a st_birthtim.
# bf876fcd	27-Mar-2010	Edward Tomasz Napierala <trasz@FreeBSD.org>	MFC r200273: Don't add VAPPEND if the file is not being opened for writing. Note that this only affects cases where open(2) is being used improperly - i.e. when the user specifies O_APPEND without O_WRONLY or O_RDWR. Reviewed by: rwatson
# 0fef797f	21-Mar-2010	Ed Schouten <ed@FreeBSD.org>	Actually make O_DIRECTORY work. According to POSIX open() must return ENOTDIR when the path name does not refer to a path name. Change vn_open() to respect this flag. This also simplifies the Linuxolator a bit.
# 9d7031a6	08-Dec-2009	Edward Tomasz Napierala <trasz@FreeBSD.org>	Don't add VAPPEND if the file is not being opened for writing. Note that this only affects cases where open(2) is being used improperly - i.e. when the user specifies O_APPEND without O_WRONLY or O_RDWR. Reviewed by: rwatson
# 931d1367	07-Dec-2009	Xin LI <delphij@FreeBSD.org>	MFC revision 197579 and 199617: Add two new fcntls to enable/disable read-ahead: - F_READAHEAD: specify the amount for sequential access. The amount is specified in bytes and is rounded up to nearest block size. - F_RDAHEAD: Darwin compatible version that use 128KB as the sequential access size. A third argument of zero disables the read-ahead behavior. Please note that the read-ahead amount is also constrainted by sysctl variable, vfs.read_max, which may need to be raised in order to better utilize this feature. Thanks Igor Sysoev for proposing the feature and submitting the original version, and kib@ for his valuable comments.
# 3b1b0980	04-Nov-2009	Edward Tomasz Napierala <trasz@FreeBSD.org>	Revert r198874, pending further discussion.
# 8fafa5ce	03-Nov-2009	Edward Tomasz Napierala <trasz@FreeBSD.org>	Make sure we don't end up with VAPPEND without VWRITE, if someone calls open(2) like this: open(..., O_APPEND).
# 82aebf69	28-Sep-2009	Xin LI <delphij@FreeBSD.org>	Add two new fcntls to enable/disable read-ahead: - F_READAHEAD: specify the amount for sequential access. The amount is specified in bytes and is rounded up to nearest block size. - F_RDAHEAD: Darwin compatible version that use 128KB as the sequential access size. A third argument of zero disables the read-ahead behavior. Please note that the read-ahead amount is also constrainted by sysctl variable, vfs.read_max, which may need to be raised in order to better utilize this feature. Thanks Igor Sysoev for proposing the feature and submitting the original version, and kib@ for his valuable comments. Submitted by: Igor Sysoev <is rambler-co ru> Reviewed by: kib@ MFC after: 1 month
# c02280f5	08-Sep-2009	Konstantin Belousov <kib@FreeBSD.org>	MFC r196692: Make the mnt_writeopcount and mnt_secondary_writes counters, used by the suspension code, not greater then mnt_ref reference counter value. MFC r196733: Fix mount reference leak when V_XSLEEP is specified to vn_start_write(). Approved by: re (kensmith)
# 579b9760	31-Aug-2009	Konstantin Belousov <kib@FreeBSD.org>	Fix mount reference leak when V_XSLEEP is specified to vn_start_write(). Submitted by: tegge
# a505c2c7	31-Aug-2009	Konstantin Belousov <kib@FreeBSD.org>	Make the mnt_writeopcount and mnt_secondary_writes counters, used by the suspension code, not greater then mnt_ref reference counter value. Increment mnt_ref together with write counter in vn_start_write()/ vn_start_secondary_write(), releasing in vn_finished_write/vn_finished_secondary_write(). Since r186197, unmount code requires that no writers occured after all references are expired. We still could get write counter incremented for freed or reused struct mount, but it seems to be innocent, since corresponding vnode should be referenced and reclaimed then. Reported by: pho (last half a year), erwin Reviewed by: attilio Tested by: pho, erwin MFC after: 1 week
# f1eccd05	02-Jul-2009	Konstantin Belousov <kib@FreeBSD.org>	In vn_vget_ino() and their inline equivalents, mnt_ref() the mount point around the sequence that drop vnode lock and then busies the mount point. Not having vlocked node or direct reference to the mp allows for the forced unmount to proceed, making mp unmounted or reused. Tested by: pho Reviewed by: jeff Approved by: re (kensmith) MFC after: 2 weeks
# e0c161b8	21-Jun-2009	Konstantin Belousov <kib@FreeBSD.org>	Add another flags argument to vn_open_cred. Use it to specify that some vn_open_cred invocations shall not audit namei path. In particular, specify VN_OPEN_NOAUDIT for dotdot lookup performed by default implementation of vop_vptocnp, and for the open done for core file. vn_fullpath is called from the audit code, and vn_open there need to disable audit to avoid infinite recursion. Core file is created on return to user mode, that, in particular, happens during syscall return. The creation of the core file is audited by direct calls, and we do not want to overwrite audit information for syscall. Reported, reviewed and tested by: rwatson
# 27bfb741	08-Jun-2009	Paul Saab <ps@FreeBSD.org>	Simply shared vnode locking and extend it to also include fsync. Also, in vop_write, no longer assert for exclusive locks on the vnode. Reviewed by: jhb, kmacy, jeffr
# bcf11e8d	05-Jun-2009	Robert Watson <rwatson@FreeBSD.org>	Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd
# 983b43c5	04-Jun-2009	Paul Saab <ps@FreeBSD.org>	When checking for shared writes, use the struct mount returned from vn_start_write. Reviewed by: jhb
# a6d545d8	04-Jun-2009	Paul Saab <ps@FreeBSD.org>	Support shared vnode locks for write operations when the offset is provided on filesystems that support it. This really improves mysql + innodb performance on ZFS. Reviewed by: jhb, kmacy, jeffr
# dfd233ed	11-May-2009	Attilio Rao <attilio@FreeBSD.org>	Remove the thread argument from the FSD (File-System Dependent) parts of the VFS. Now all the VFS_* functions and relating parts don't want the context as long as it always refers to curthread. In some points, in particular when dealing with VOPs and functions living in the same namespace (eg. vflush) which still need to be converted, pass curthread explicitly in order to retain the old behaviour. Such loose ends will be fixed ASAP. While here fix a bug: now, UFS_EXTATTR can be compiled alone without the UFS_EXTATTR_AUTOSTART option. VFS KPI is heavilly changed by this commit so thirdy parts modules needs to be recompiled. Bump __FreeBSD_version in order to signal such situation.
# 41b72e6e	07-May-2009	Konstantin Belousov <kib@FreeBSD.org>	Eliminate the loop and the call to pause(9) in vfs_vget_ino(). If vfs_busy(MBF_NOWAIT) failed, unlock the vnode and sleep in vfs_busy(). Suggested and reviewed by: jeff Tested by: pho MFC after: 3 weeks
# 5e6a9266	13-Apr-2009	Kip Macy <kmacy@FreeBSD.org>	- use a shared lock for reads - remove stale comment Reviewed by: jeffr
# 885868cd	10-Apr-2009	Robert Watson <rwatson@FreeBSD.org>	Remove VOP_LEASE and supporting functions. This hasn't been used since the removal of NQNFS, but was left in in case it was required for NFSv4. Since our new NFSv4 client and server can't use it for their requirements, GC the old mechanism, as well as other unused lease- related code and interfaces. Due to its impact on kernel programming and binary interfaces, this change should not be MFC'd. Proposed by: jeff Reviewed by: jeff Discussed with: rmacklem, zach loafman @ isilon
# 33fc3625	11-Mar-2009	John Baldwin <jhb@FreeBSD.org>	Add a new internal mount flag (MNTK_EXTENDED_SHARED) to indicate that a filesystem supports additional operations using shared vnode locks. Currently this is used to enable shared locks for open() and close() of read-only file descriptors. - When an ISOPEN namei() request is performed with LOCKSHARED, use a shared vnode lock for the leaf vnode only if the mount point has the extended shared flag set. - Set LOCKSHARED in vn_open_cred() for requests that specify O_RDONLY but not O_CREAT. - Use a shared vnode lock around VOP_CLOSE() if the file was opened with O_RDONLY and the mountpoint has the extended shared flag set. - Adjust md(4) to upgrade the vnode lock on the vnode it gets back from vn_open() since it now may only have a shared vnode lock. - Don't enable shared vnode locks on FIFO vnodes in ZFS and UFS since FIFO's require exclusive vnode locks for their open() and close() routines. (My recent MPSAFE patches for UDF and cd9660 already included this change.) - Enable extended shared operations on UFS, cd9660, and UDF. Submitted by: ups Reviewed by: pjd (ZFS bits) MFC after: 1 month
# e9aff357	21-Jan-2009	Konstantin Belousov <kib@FreeBSD.org>	Move the code from ufs_lookup.c used to do dotdot lookup, into the helper function. It is supposed to be useful for any filesystem that has to unlock dvp to walk to the ".." entry in lookup routine. Requested by: jhb Tested by: pho MFC after: 1 month
# 0c6a80e7	28-Nov-2008	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Improve KASSERT() call a bit: - Print flags in hex. - Note that flags can be fine and panic can be due unexpected error condition. - Remove redundant new line character. Eventhough panic message excess 80 characters keep it in one line so it is easier to grep.
# c5f77bf9	16-Nov-2008	Konstantin Belousov <kib@FreeBSD.org>	Revert r184118. There is actually a code in the kernel, for instance in kern_unlinkat(), that expects that vn_start_write() actually fills the mp even when the call failed. As Tor noted, that pattern relies on the the type stability of the mount points, as well as that suspended mount points are never freed and V_XSLEEP is always passed to vn_start_write() when called on a freed mount point. Reported by: stass Reviewed by: tegge PR: 123768
# 21fc02d2	03-Nov-2008	John Baldwin <jhb@FreeBSD.org>	Use shared vnode locks instead of exclusive vnode locks for the access(), chdir(), chroot(), eaccess(), fpathconf(), fstat(), fstatfs(), lseek() (when figuring out the current size of the file in the SEEK_END case), pathconf(), readlink(), and statfs() system calls. Submitted by: ups (mostly) Tested by: pho MFC after: 1 month
# 83b3bdbc	02-Nov-2008	Attilio Rao <attilio@FreeBSD.org>	Improve VFS locking: - Implement real draining for vfs consumers by not relying on the mnt_lock and using instead a refcount in order to keep track of lock requesters. - Due to the change above, remove the mnt_lock lockmgr because it is now useless. - Due to the change above, vfs_busy() is no more linked to a lockmgr. Change so its KPI by removing the interlock argument and defining 2 new flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the old version (which was unlinked from the lockmgr alredy) and MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx once the mnt interlock is held (ability still desired by most consumers). - The stub used into vfs_mount_destroy(), that allows to override the mnt_ref if running for more than 3 seconds, make it totally useless. Remove it as it was thought to work into older versions. If a problem of "refcount held never going away" should appear, we will need to fix properly instead than trust on such hackish solution. - Fix a bug where returning (with an error) from dounmount() was still leaving the MNTK_MWAIT flag on even if it the waiters were actually woken up. Just a place in vfs_mount_destroy() is left because it is going to recycle the structure in any case, so it doesn't matter. - Remove the markercnt refcount as it is useless. This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and __FreeBSD_version will be modified accordingly. Discussed with: kib Tested by: pho
# 15bc6b2b	28-Oct-2008	Edward Tomasz Napierala <trasz@FreeBSD.org>	Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary to add more V* constants, and the variables changed by this patch were often being assigned to mode_t variables, which is 16 bit. Approved by: rwatson (mentor)
# 3ba28ace	21-Oct-2008	Konstantin Belousov <kib@FreeBSD.org>	Change vn_start_write() to clear *mpp on all failures when non-NULL vp is supplied, since vm_pageout_scan() expects it to be cleared on error. Submitted by: tegge PR: 123768 MFC after: 1 week
# 016f98f9	20-Oct-2008	Konstantin Belousov <kib@FreeBSD.org>	Assert that v_holdcnt is non-zero before entering lockmgr in vn_lock and ffs_lock. This cannot catch situations where holdcnt is incremented not by curthread, but I think it is useful. Reviewed by: tegge, attilio Tested by: pho MFC after: 2 weeks
# d7f03759	19-Oct-2008	Ulf Lilleengen <lulf@FreeBSD.org>	- Import the HEAD csup code which is the basis for the cvsmode work.
# 0fbbf2ea	20-Sep-2008	Konstantin Belousov <kib@FreeBSD.org>	Initialize va_rdev to NODEV and va_fsid to VNOVAL before the VOP_GETATTR() call in vn_stat(). Thus if a file system doesn't initialize those fields in VOP_GETATTR() they will have a sane default value. Submitted by: Jaakko Heinonen <jh saunalahti fi> Discussed on: freebsd-fs MFC after: 1 month
# ea60a5f5	20-Sep-2008	Konstantin Belousov <kib@FreeBSD.org>	Initialize birthtime fields in vn_stat() to prevent stat(2) from returning uninitialized birthtime. Most file systems don't initialize birthtime properly in their VOP_GETTATTR(). Submitted by: Jaakko Heinonen <jh saunalahti fi> Reviewed by: bde Discussed on: freebsd-fs MFC after: 1 month
# 2814d5ba	16-Sep-2008	Konstantin Belousov <kib@FreeBSD.org>	When attempt is made to suspend a filesystem that is already syspended, wait until the current suspension is lifted instead of silently returning success immediately. The consequences of calling vfs_write() resume when not owning the suspension are not well-defined at best. Add the vfs_susp_clean() mount method to be called from vfs_write_resume(). Set it to process_deferred_inactive() for ffs, and stop calling it manually. Add the thread flag TDP_IGNSUSP that allows to bypass the suspension point in the vn_start_write. It is intended for use by VFS in the situations where the suspender want to do some i/o requiring calls to vn_start_write(), and this i/o cannot be done later. Reviewed by: tegge In collaboration with: pho MFC after: 1 month
# bdb80947	16-Sep-2008	Konstantin Belousov <kib@FreeBSD.org>	Garbage-collect vn_write_suspend_wait(). Suggested and reviewed by: tegge Tested by: pho MFC after: 1 month
# 0359a12e	28-Aug-2008	Attilio Rao <attilio@FreeBSD.org>	Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread was always curthread and totally unuseful. Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
# 1d986c5f	03-Aug-2008	Robert Watson <rwatson@FreeBSD.org>	Remove broken code to replace st_mode value with ACCESSPERMS when lstat(2) is called on symlinks -- this code appears never to have worked. The PR this addresses suggests that the intended original behavior is the right one, but as bde points out in the PR comments, we do actually support storing a mode on symlinks, so returning it seems reasonable. This is consistent with Mac OS X, which despite documentation to the contrary does return the mode set on a symlink, but not some other platforms. The Single Unix Spec requires only that the returned bits be "meaningful", which seems at best unhelpful as advice goes. PR: 25018 MFC after: 3 days
# e314f69f	31-Mar-2008	Konstantin Belousov <kib@FreeBSD.org>	Add the support for the O_EXEC open(2) mode, as specified by the POSIX Extended API Set Part 2 extension specification. Reviewed by: rwatson, rdivacky Tested by: pho
# 5634d486	29-Mar-2008	Jeff Roberson <jeff@FreeBSD.org>	- Don't allow calls to vn_lock() with no lock type requested. Callers which simply want a reference should use vref(). Callers which want to check validity need to hold a lock while performing any action based on that validity. vn_lock() would always release the interlock before returning making any action synchronous with the validity check impossible.
# 804e60d4	23-Mar-2008	Jeff Roberson <jeff@FreeBSD.org>	- Don't acquire the vnode interlock in _vn_lock() unless no lock type is requested. Handle this case specially before the while loop. - Use the held vnode lock to check for VI_DOOMED. The vnode lock and interlock must both be held to set VI_DOOMED so either one held, even shared, is sufficient to check it. No objection by: kib
# 22db15c0	13-Jan-2008	Attilio Rao <attilio@FreeBSD.org>	VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in conjuction with 'thread' argument passing which is always curthread. Remove the unuseful extra-argument and pass explicitly curthread to lower layer functions, when necessary. KPI results broken by this change, which should affect several ports, so version bumping and manpage update will be further committed. Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
# cb05b60a	09-Jan-2008	Attilio Rao <attilio@FreeBSD.org>	vn_lock() is currently only used with the 'curthread' passed as argument. Remove this argument and pass curthread directly to underlying VOP_LOCK1() VFS method. This modify makes the code cleaner and in particular remove an annoying dependence helping next lockmgr() cleanup. KPI results, obviously, changed. Manpage and FreeBSD_version will be updated through further commits. As a side note, would be valuable to say that next commits will address a similar cleanup about VFS methods, in particular vop_lock1 and vop_unlock. Tested by: Diego Sardina <siarodx at gmail dot com>, Andrea Di Pasquale <whyx dot it at gmail dot com>
# e4650294	07-Jan-2008	John Baldwin <jhb@FreeBSD.org>	Make ftruncate a 'struct file' operation rather than a vnode operation. This makes it possible to support ftruncate() on non-vnode file types in the future. - 'struct fileops' grows a 'fo_truncate' method to handle an ftruncate() on a given file descriptor. - ftruncate() moves to kern/sys_generic.c and now just fetches a file object and invokes fo_truncate(). - The vnode-specific portions of ftruncate() move to vn_truncate() in vfs_vnops.c which implements fo_truncate() for vnode file types. - Non-vnode file types return EINVAL in their fo_truncate() method. Submitted by: rwatson
# 92838485	05-Jan-2008	Bruce Evans <bde@FreeBSD.org>	In sequential_heuristic(): - spell 16384 as 16384 and not as BKVASIZE. 16384 is (not quite) just a magic size that works well in practice. BKVASIZE should be MAXBSIZE (65536), but is 16384 because i386's don't have enough kva for it to be MAXBSIZE; 16384 works (not so well) for it for much the same reasons that it works well in the heuristic. - expand and/or add comments about this and other details. - don't explicitly inline this function. - fix some other style bugs.
# 397c19d1	29-Dec-2007	Jeff Roberson <jeff@FreeBSD.org>	Remove explicit locking of struct file. - Introduce a finit() which is used to initailize the fields of struct file in such a way that the ops vector is only valid after the data, type, and flags are valid. - Protect f_flag and f_count with atomic operations. - Remove the global list of all files and associated accounting. - Rewrite the unp garbage collection such that it no longer requires the global list of all files and instead uses a list of all unp sockets. - Mark sockets in the accept queue so we don't incorrectly gc them. Tested by: kris, pho
# 30d239bc	24-Oct-2007	Robert Watson <rwatson@FreeBSD.org>	Merge first in a series of TrustedBSD MAC Framework KPI changes from Mac OS X Leopard--rationalize naming for entry points to the following general forms: mac_<object>_<method/action> mac_<object>_check_<method/action> The previous naming scheme was inconsistent and mostly reversed from the new scheme. Also, make object types more consistent and remove spaces from object types that contain multiple parts ("posix_sem" -> "posixsem") to make mechanical parsing easier. Introduce a new "netinet" object type for certain IPv4/IPv6-related methods. Also simplify, slightly, some entry point names. All MAC policy modules will need to be recompiled, and modules not updates as part of this commit will need to be modified to conform to the new KPI. Sponsored by: SPARTA (original patches against Mac OS X) Obtained from: TrustedBSD Project, Apple Computer
# 57fd3d55	26-Jul-2007	Pawel Jakub Dawidek <pjd@FreeBSD.org>	When we do open, we should lock the vnode exclusively. This fixes few races: - fifo race, where two threads assign v_fifoinfo, - v_writecount modifications, - v_object modifications, - and probably more... Discussed with: kib, ups Approved by: re (rwatson)
# 9e223287	31-May-2007	Konstantin Belousov <kib@FreeBSD.org>	Revert UF_OPENING workaround for CURRENT. Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation argument from being file descriptor index into the pointer to struct file. Proposed and reviewed by: jhb Reviewed by: daichi (unionfs) Approved by: re (kensmith)
# d413d210	18-May-2007	Konstantin Belousov <kib@FreeBSD.org>	Since renaming of vop_lock to _vop_lock, pre- and post-condition function calls are no more generated for vop_lock. Rename _vop_lock to vop_lock1 to satisfy tools/vnode_if.awk assumption about vop naming conventions. This restores pre/post-condition calls.
# c6b342f8	17-May-2007	Peter Wemm <peter@FreeBSD.org>	Eliminate a micro-optimization that hasn't had any effect for 15+ years.
# 87aabdc1	12-Feb-2007	Mike Pritchard <mpp@FreeBSD.org>	Add a VNASSERT to vn_close to detect if v_writecount is going to become negative. This will detect the underflow when it happens, instead of having it discovered when the vnode is taken off the freelist, long after the offending process is long gone.
# 2f6a774b	12-Nov-2006	Kip Macy <kmacy@FreeBSD.org>	change vop_lock handling to allowing tracking of callers' file and line for acquisition of lockmgr locks Approved by: scottl (standing in for mentor rwatson)
# acd3428b	06-Nov-2006	Robert Watson <rwatson@FreeBSD.org>	Sweep kernel replacing suser(9) calls with priv(9) calls, assigning specific privilege names to a broad range of privileges. These may require some future tweaking. Sponsored by: nCircle Network Security, Inc. Obtained from: TrustedBSD Project Discussed on: arch@ Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri, Alex Lyashkov <umka at sevcity dot net>, Skip Ford <skip dot ford at verizon dot net>, Antoine Brodin <antoine dot brodin at laposte dot net>
# aed55708	22-Oct-2006	Robert Watson <rwatson@FreeBSD.org>	Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now contains the userspace and user<->kernel API and definitions, with all in-kernel interfaces moved to mac_framework.h, which is now included across most of the kernel instead. This change is the first step in a larger cleanup and sweep of MAC Framework interfaces in the kernel, and will not be MFC'd. Obtained from: TrustedBSD Project Sponsored by: SPARTA
# 92c08499	24-Jun-2006	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Simplify the code and remove two mutex operations. MFC after: 2 weeks
# 6befa6ae	16-May-2006	Paul Saab <ps@FreeBSD.org>	Allow concurrent read(2)/readv(2) access to a file. Lock file offset against multiple read calls. Submitted by: ups Obtained from: Yahoo! MFC after: 2 weeks
# 122410ee	28-Apr-2006	Pawel Jakub Dawidek <pjd@FreeBSD.org>	vn_start_write() is called only when v_type != VCHR, so corresponding vn_finished_write() should also be called only then. BTW. I fixed two functions here: vn_rdwr() and vn_write(). The latter seems to be unused. MFC after: 3 weeks
# 3bbd6d8a	30-Mar-2006	Jeff Roberson <jeff@FreeBSD.org>	- Release the references acquired by VOP_GETWRITEMOUNT and vfs_getvfs(). Discussed with: tegge Tested by: kris Sponsored by: Isilon Systems, Inc.
# 861dab08	28-Mar-2006	John Baldwin <jhb@FreeBSD.org>	Change vn_open() to honor the MPSAFE flag in the passed in nameidata object and use that instead of testing fdidx against -1 to determine if it should release Giant if Giant was locked due to the requested file residing on a non-MPSAFE VFS. Discussed with: jeff
# bacb51fb	21-Mar-2006	Jeff Roberson <jeff@FreeBSD.org>	- Remove explicit giant acquires and replace it with VFS_LOCK_GIANT. Sponsored by: Isilon Systems, Inc.
# a19fd0e7	11-Mar-2006	Christian S.J. Peron <csjp@FreeBSD.org>	Make sure that we are adding a path token to the audit record in open(2). Do this by making sure we are using the AUDITVNODE1 mask in the namei flags. Obtained from: TrustedBSD Project
# ca2fa807	10-Mar-2006	Tor Egge <tegge@FreeBSD.org>	Block secondary writes while expunging active unlinked files. Fix detection of active unlinked files by checking VI_OWEINACT and VI_DOINGINACT in addition to v_usecount. Defer inactive handling for unlinked files if the file system is mostly suspended (secondary writes being blocked). Perform deferred inactive handling after the file system is resumed.
# 791dd2fa	08-Mar-2006	Tor Egge <tegge@FreeBSD.org>	Use vn_start_secondary_write() and vn_finished_secondary_write() as a replacement for vn_write_suspend_wait() to better account for secondary write processing. Close race where secondary writes could be started after ffs_sync() returned but before the file system was marked as suspended. Detect if secondary writes or softdep processing occurred during vnode sync loop in ffs_sync() and retry the loop if needed.
# 0430a5e2	13-Dec-2005	Dag-Erling Smørgrav <des@FreeBSD.org>	Eradicate caddr_t from the VFS API.
# e8ddb61d	02-Aug-2005	Jeff Roberson <jeff@FreeBSD.org>	- Replace the series of DEBUG_LOCKS hacks which tried to save the vn_lock caller by saving the stack of the last locker/unlocker in lockmgr. We also put the stack in KTR at the moment. Contributed by: Antoine Brodin <antoine.brodin@laposte.net>
# dbb3ec5c	13-Jun-2005	Jeff Roberson <jeff@FreeBSD.org>	- Remove vnode lock asserts at the end of vfs syscalls. These asserts were used to ensure that we weren't exiting the syscall with a lock still held. This wasn't safe, however, because we'd already executed a vput() and on a loaded system the vnode may have been free'd by the time we assert. This functionality is also handled by the td_locks assert in userret, which doesn't tell you what the syscall was, but will at least panic before you deadlock. Sponsored by: Isilon Systems, Inc. Discovred by: Peter Holm Approved by: re (blanket vfs)
# d598b04d	12-Jun-2005	Jeff Roberson <jeff@FreeBSD.org>	- It has long been my suspicion that we don't actually need a loop in vn_lock(). Add an assert that will help me gain more confidence that this is correct. Sponsored by: Isilon Systems, Inc.
# 54981733	27-Apr-2005	Jeff Roberson <jeff@FreeBSD.org>	- Stop checking vxthread, we've asserted that it was useless for several weeks.
# 7625cbf3	27-Apr-2005	Jeff Roberson <jeff@FreeBSD.org>	- Pass the ISOPEN flag to namei so filesystems will know we're about to open them or otherwise access the data.
# 1b19c74d	11-Apr-2005	Jeff Roberson <jeff@FreeBSD.org>	- Assert that we're no longer doing recursive vn_locks in inactive/reclaim as I'd like to get rid of the vxthread. - Handle lock requests which don't actually want a lock as this is a much more convenient place to handle this condition than in vget(). These requests simply want to know that VI_DOOMED isn't set. - Correct a test at the end of vn_lock, if error !=0 should be if error == 0, this has been broken since I comitted the VI_DOOMED changes, but no one ran into it because vget() duplicated this functionality. Sponsored by: Isilon Systems, Inc.
# f3e89267	04-Apr-2005	Christian S.J. Peron <csjp@FreeBSD.org>	Assert that the vnode is locked. This is meant to catch bugs or mis-use of the vnode API in conditions where IO_NODELOCKED has been used without the vnode actually being locked.
# f247a524	30-Mar-2005	Jeff Roberson <jeff@FreeBSD.org>	- LK_NOPAUSE is a nop now. Sponsored by: Isilon Systems, Inc.
# 3e6bcad3	23-Mar-2005	Jeff Roberson <jeff@FreeBSD.org>	- Remove some long dead LOOKUP_SHARED code that tracked the lock state. - Always pass LOCKSHARED and rely on namei() to ignore it when LOOKUP_SHARED is not set. Sponsored by: Isilon Systems, Inc.
# 0463dc9e	13-Mar-2005	Jeff Roberson <jeff@FreeBSD.org>	- Do a vn_start_write in vn_close, we may write if this is the last ref on an unlinked file. We can't know if this is the case until after we have the lock. - Lock the vnode in vn_close, many filesystems had code which was unsafe without the lock held, and holding it greatly simplifies vgone(). - Adjust vn_lock() to check for the VI_DOOMED flag where appropriate. Sponsored by: Isilon Systems, Inc.
# cd138194	23-Feb-2005	Christian S.J. Peron <csjp@FreeBSD.org>	Add locking assertions into vn_extattr_set, vn_extattr_get and vn_extattr_rm. This is meant to catch conditions where IO_NODELOCKED has been specified without the vnode being locked. Discussed with: rwatson MFC after: 1 week
# 4d8ac58b	17-Feb-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Introduce vx_wait{l}() and use it instead of home-rolled versions.
# dcff5b14	24-Jan-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Don't call VOP_CREATEVOBJECT(), it's the responsibility of the filesystem which owns the vnode.
# f50a2d5e	24-Jan-2005	Jeff Roberson <jeff@FreeBSD.org>	- Remove GIANT_REQUIRED where giant is no longer required. - Protect access to mnt_kern_flag with the mountpoint mutex. - Use the appropriate nd flags to deal with giant in vn_open_cred(). We currently determine whether the caller is mpsafe by checking for a valid fdidx. Any caller coming from user-space is now mpsafe and supplies a valid fd. No kenrel callers have been converted to mpsafe, so this check is sufficient for now. - Use VFS_LOCK_GIANT instead of manual giant acquisition where appropriate. Sponsored By: Isilon Systems, Inc.
# e39db32a	12-Jan-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Ditch vfs_object_create() and make the callers call VOP_CREATEVOBJECT() directly.
# 8df6bac4	11-Jan-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC(). I'm not sure why a credential was added to these in the first place, it is not used anywhere and it doesn't make much sense: The credentials for syncing a file (ability to write to the file) should be checked at the system call level. Credentials for syncing one or more filesystems ("none") should be checked at the system call level as well. If the filesystem implementation needs a particular credential to carry out the syncing it would logically have to the cached mount credential, or a credential cached along with any delayed write data. Discussed with: rwatson
# 9454b2d8	06-Jan-2005	Warner Losh <imp@FreeBSD.org>	/* -> /*- for copyright notices, minor format tweaks as necessary
# 18dc7373	18-Nov-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Ok, first blunder: ioctls are not entirely unused on vnodes anymore :-) Add dropped call to VOP_IOCTL().
# a0fbccc9	17-Nov-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Push Giant down through ioctl. Don't grab Giant in the upper syscall/wrapper code NET_LOCK_GIANT in the socket code (sockets/fifos). mtx_lock(&Giant) in the vnode code. mtx_lock(&Giant) in the opencrypto code. (This may actually not be needed, but better safe than sorry). Devfs grabs Giant if the driver is marked as needing Giant.
# db446e30	17-Nov-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Push Giant down through select and poll. Don't grab Giant in the upper syscall/wrapper code NET_LOCK_GIANT in the socket code (sockets/fifos). mtx_lock(&Giant) in the vnode code. Devfs grabs Giant if the driver is marked as needing Giant.
# f6083975	15-Nov-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Give vn_poll single exit point (to make it easier to insert "mtx_unlock(&Giant)" real soon now).
# c5b846fe	10-Nov-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Slim vnodes by another four bytes by eliminating the (now) unused field v_cachedid.
# b797084e	09-Nov-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Remove vnode->v_cachedfs. It was only used for the highly dangerous "export all vnodes with a sysctl" function.
# 5d9d81e7	26-Oct-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Put the I/O block size in bufobj->bo_bsize. We keep si_bsize_phys around for now as that is the simplest way to pull the number out of disk device drivers in devfs_open(). The correct solution would be to do an ioctl(DIOCGSECTORSIZE), but the point is probably mooth when filesystems sit on GEOM, so don't bother for now.
# 9b7cc97f	26-Oct-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Remove unused si_bsize_best field from struct cdev.
# 6e8d4202	24-Sep-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Hold dev_lock and check for NULL devsw pointer when we service FIODTYPE ioctl.
# 90a660e1	21-Sep-2004	Poul-Henning Kamp <phk@FreeBSD.org>	If a vnode has no v_rdev we cannot hope to answer FIODTYPE ioctl.
# ad3b9257	15-Aug-2004	John-Mark Gurney <jmg@FreeBSD.org>	Add locking to the kqueue subsystem. This also makes the kqueue subsystem a more complete subsystem, and removes the knowlege of how things are implemented from the drivers. Include locking around filter ops, so a module like aio will know when not to be unloaded if there are outstanding knotes using it's filter ops. Currently, it uses the MTX_DUPOK even though it is not always safe to aquire duplicate locks. Witness currently doesn't support the ability to discover if a dup lock is ok (in some cases). Reviewed by: green, rwatson (both earlier versions)
# db532b63	06-Aug-2004	Robert Watson <rwatson@FreeBSD.org>	Flag a broad range of VFS operations as GIANT_REQUIRED in order to catch leaking into VFS without Giant. Inch Giant a little lower in several file descriptor operations on vnodes to cover only VFS operations that need it, rather than file flag reading, etc.
# a6719c82	22-Jul-2004	Robert Watson <rwatson@FreeBSD.org>	Push Giant acquisition down into fo_stat() from most callers. Acquire Giant conditional on debug.mpsafenet in the socket soo_stat() routine, unconditionally in vn_statfile() for VFS, and otherwise don't acquire Giant. Accept an unlocked read in kqueue_stat(), and cryptof_stat() is a no-op. Don't acquire Giant in fstat() system call. Note: in fdescfs, fo_stat() is called while holding Giant due to the VFS stack sitting on top, and therefore there will still be Giant recursion in this case.
# 1c1ce925	22-Jul-2004	Robert Watson <rwatson@FreeBSD.org>	Push acquisition of Giant from fdrop_closed() into fo_close() so that individual file object implementations can optionally acquire Giant if they require it: - soo_close(): depends on debug.mpsafenet - pipe_close(): Giant not acquired - kqueue_close(): Giant required - vn_close(): Giant required - cryptof_close(): Giant required (conservative) Notes: Giant is still acquired in close() even when closing MPSAFE objects due to kqueue requiring Giant in the calling closef() code. Microbenchmarks indicate that this removal of Giant cuts 3%-3% off of pipe create/destroy pairs from user space with SMP compiled into the kernel. The cryptodev and opencrypto code appears MPSAFE, but I'm unable to test it extensively and so have left Giant over fo_close(). It can probably be removed given some testing and review.
# 32240d08	10-Jul-2004	Marcel Moolenaar <marcel@FreeBSD.org>	Update for the KDB framework: o Call kdb_enter() instead of Debugger().
# f99619a0	04-Jun-2004	Tim J. Robbins <tjr@FreeBSD.org>	Change the types of vn_rdwr_inchunks()'s len and aresid arguments to size_t and size_t *, respectively. Update callers for the new interface. This is a better fix for overflows that occurred when dumping segments larger than 2GB to core files.
# f3d055b6	01-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Rather than assert f_type==DTYPE_VNODE, conditionally perform the file lock release based on f_type==DTYPE_VNODE. vn_closefile() is used by non-vnode types as well (fifo).
# 63732dce	01-Jun-2004	Robert Watson <rwatson@FreeBSD.org>	Push the VOP_ADVLOCK() call to release advisory locks on vnode file descriptors out of fdrop_locked() and into vn_closefile(). This removes all knowledge of vnodes from fdrop_locked(), since the lock behavior was specific to vnodes. This also removes the specific requirement for Giant in fdrop_locked(), it's now only required by code that it calls into. Add GIANT_REQUIRED to vn_closefile() since VFS requires Giant.
# e79962db	31-May-2004	Robert Watson <rwatson@FreeBSD.org>	Assert Giant in vn_start_write() and vn_finished_write().
# 7f8a436f	05-Apr-2004	Warner Losh <imp@FreeBSD.org>	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999. Approved by: core
# 0249823e	12-Mar-2004	Bruce Evans <bde@FreeBSD.org>	Align the offset in vn_rdwr_inchunks() so that at most the first and the last chunk are misaligned relative to a MAXBSIZE byte boundary. vn_rdwr_inchunks() is used mainly for elf core dumps, and elf sections are usually perfectly misaligned relative to MAXBSIZE, and chunking prevents the file system from doing much realigning. This gives a surprisingly large speedup for core dumps -- from 50 to 13 seconds for a 512MB core dump here. The pessimization was mostly from an interaction of the misalignment with IO_DIRECT. It increased the number of i/o's for each chunk by a factor of 5 (3 writes and 2 read-before-writes instead of 1 write).
# 9efe7d9d	28-Dec-2003	Bruce Evans <bde@FreeBSD.org>	v_vxproc was a bogus name for a thread (pointer).
# 1de1f935	04-Oct-2003	Jeff Roberson <jeff@FreeBSD.org>	- If we are called with LK_NOWAIT in vn_lock() we may be holding a mutex and should not sleep while waiting for XLOCK to clear. Care needs to be taken in functions that use this capability to avoid spinning.
# 9080ff25	28-Jul-2003	Robert Watson <rwatson@FreeBSD.org>	Rename VOP_RMEXTATTR() to VOP_DELETEEXTATTR() for consistency with the kernel ACL interfaces and system call names. Break out UFS2 and FFS extattr delete and list vnode operations from setextattr and getextattr to deleteextattr and listextattr, which cleans up the implementations, and makes the results more readable, and makes the APIs more clear. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 3ab6b09c	27-Jul-2003	Poul-Henning Kamp <phk@FreeBSD.org>	Pass the fdidx argument from vn_open{_cred}() onto VOP_OPEN()
# 7c89f162	27-Jul-2003	Poul-Henning Kamp <phk@FreeBSD.org>	Add fdidx argument to vn_open() and vn_open_cred() and pass -1 throughout.
# a8d43c90	26-Jul-2003	Poul-Henning Kamp <phk@FreeBSD.org>	Add a "int fd" argument to VOP_OPEN() which in the future will contain the filedescriptor number on opens from userland. The index is used rather than a "struct file " since it conveys a bit more information, which may be useful to in particular fdescfs and /dev/fd/ For now pass -1 all over the place.
# 6b42f0a2	22-Jun-2003	Robert Watson <rwatson@FreeBSD.org>	Prefer the vop_rmextattr() vnode operation for removing extended attributes from objects over vop_setextattr() with a NULL uio; if the file system doesn't support the vop_rmextattr() method, fall back to the vop_setextattr() method. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 3b6d9652	22-Jun-2003	Poul-Henning Kamp <phk@FreeBSD.org>	Add a f_vnode field to struct file. Several of the subtypes have an associated vnode which is used for stuff like the f*() functions. By giving the vnode a speparate field, a number of checks for the specific subtype can be replaced simply with a check for f_vnode != NULL, and we can later free f_data up to subtype specific use. At this point in time, f_data still points to the vnode, so any code I might have overlooked will still work.
# 2db4b023	18-Jun-2003	Poul-Henning Kamp <phk@FreeBSD.org>	Introduce a new flag on a file descriptor: DFLAG_SEEKABLE and use that rather than assume that only DTYPE_VNODE is seekable.
# 7c2d2efd	18-Jun-2003	Poul-Henning Kamp <phk@FreeBSD.org>	Initialize struct fileops with C99 sparse initialization.
# 677b542e	10-Jun-2003	David E. O'Brien <obrien@FreeBSD.org>	Use __FBSDID().
# 0b955134	03-Jun-2003	Robert Watson <rwatson@FreeBSD.org>	Assert the vnode lock when returning successfully from vn_open_cred().
# 104a9b7e	29-Apr-2003	Alexander Kabaev <kan@FreeBSD.org>	Deprecate machine/limits.h in favor of new sys/limits.h. Change all in-tree consumers to include <sys/limits.h> Discussed on: standards@ Partially submitted by: Craig Rodrigues <rodrigc@attbi.com>
# 128a0bb7	26-Mar-2003	Tor Egge <tegge@FreeBSD.org>	fp->f_offset doesn't need any protection when it isn't accessed.
# e7d6662f	14-Feb-2003	Alfred Perlstein <alfred@FreeBSD.org>	Do not allow kqueues to be passed via unix domain sockets.
# 48e3128b	12-Jan-2003	Matthew Dillon <dillon@FreeBSD.org>	Bow to the whining masses and change a union back into void *. Retain removal of unnecessary casts and throw in some minor cleanups to see if anyone complains, just for the hell of it.
# cd72f218	11-Jan-2003	Matthew Dillon <dillon@FreeBSD.org>	Change struct file f_data to un_data, a union of the correct struct pointer types, and remove a huge number of casts from code using it. Change struct xfile xf_data to xun_data (ABI is still compatible). If we need to add a #define for f_data and xf_data we can, but I don't think it will be necessary. There are no operational changes in this commit.
# c579babe	07-Jan-2003	Brian Feldman <green@FreeBSD.org>	In vn_open(), unset ndp->ni_vp when returning failure so that code which expects it to be NULL unless the return value was 0 will work. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 45587e25	28-Dec-2002	Matthew Dillon <dillon@FreeBSD.org>	Abstract-out the constants for the sequential heuristic. No operational changes. MFC after: 1 day
# a7010ee2	24-Dec-2002	Poul-Henning Kamp <phk@FreeBSD.org>	White-space changes.
# f3a68211	23-Dec-2002	Poul-Henning Kamp <phk@FreeBSD.org>	Detediousficate declaration of fileops array members by introducing typedefs for them.
# 9ab73fd1	24-Oct-2002	Kirk McKusick <mckusick@FreeBSD.org>	Within ufs, the ffs_sync and ffs_fsync functions did not always check for and/or report I/O errors. The result is that a VFS_SYNC or VOP_FSYNC called with MNT_WAIT could loop infinitely on ufs in the presence of a hard error writing a disk sector or in a filesystem full condition. This patch ensures that I/O errors will always be checked and returned. This patch also ensures that every call to VFS_SYNC or VOP_FSYNC with MNT_WAIT set checks for and takes appropriate action when an error is returned. Sponsored by: DARPA & NAI Labs.
# 89c61753	19-Oct-2002	Robert Watson <rwatson@FreeBSD.org>	Drop in the MAC check for file creation as part of open(). Approved by: re Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 3c275c19	26-Sep-2002	Poul-Henning Kamp <phk@FreeBSD.org>	Under DIAGNOSTIC, complain if ENOIOCTL leaks out through VOP_IOCTL().
# 93b0017f	25-Aug-2002	Philippe Charnier <charnier@FreeBSD.org>	Replace various spelling with FALLTHROUGH which is lint()able
# ad32f726	22-Aug-2002	Jeff Roberson <jeff@FreeBSD.org>	- Fix a mistake in my last few commits. The PDROP flag stops msleep from re-acquiring the mutex. Pointy hat to: me Noticed by: tegge
# 4b6049ca	22-Aug-2002	Jeff Roberson <jeff@FreeBSD.org>	- Closer inspection revealed a possible deadlock situation in vn_lock() that was introduced by my last commit but not caught by stress testing. Fix that and slightly restructure the code so that it is more readable.
# 9abf54f0	22-Aug-2002	Jeff Roberson <jeff@FreeBSD.org>	- Make vn_lock() vget() and VOP_LOCK() all behave the same way WRT LK_INTERLOCK. The interlock will never be held on return from these functions even when there is an error. Errors typically only occur when the XLOCK is held which means this isn't the vnode we want anyway. Almost all users of these interfaces expected this behavior even though it was not provided before.
# 510939d0	22-Aug-2002	Jeff Roberson <jeff@FreeBSD.org>	- Return two shared locks to exclusive locks. This was premature. - Document the problems that prevent us from using shared locks.
# 6c54a1f5	22-Aug-2002	Jeff Roberson <jeff@FreeBSD.org>	- Fix interlock handling in vn_lock(). Previously, vn_lock() could return with interlock held in error conditions when the caller did not specify LK_INTERLOCK. - Add several comments to vn_lock() describing the rational behind the code flow since it was not immediately obvious.
# 0b600db4	21-Aug-2002	Jeff Roberson <jeff@FreeBSD.org>	- Document two cases, one in vget and the other in vn_lock, where the state of interlock on exit is not consistent. There are probably several bugs relating to this.
# 177142e4	19-Aug-2002	Robert Watson <rwatson@FreeBSD.org>	Pass active_cred and file_cred into the MAC framework explicitly for mac_check_vnode_{poll,read,stat,write}(). Pass in fp->f_cred when calling these checks with a struct file available. Otherwise, pass NOCRED. All currently MAC policies use active_cred, but could now offer the cached credential semantic used for the base system security model. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 7f724f8b	19-Aug-2002	Robert Watson <rwatson@FreeBSD.org>	Break out mac_check_vnode_op() into three seperate checks: mac_check_vnode_poll(), mac_check_vnode_read(), mac_check_vnode_write(). This improves the consistency with other existing vnode checks, and allows policies to avoid implementing switch statements to determine what operations they do and do not want to authorize. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# d49fa1ca	16-Aug-2002	Robert Watson <rwatson@FreeBSD.org>	In continuation of early fileop credential changes, modify fo_ioctl() to accept an 'active_cred' argument reflecting the credential of the thread initiating the ioctl operation. - Change fo_ioctl() to accept active_cred; change consumers of the fo_ioctl() interface to generally pass active_cred from td->td_ucred. - In fifofs, initialize filetmp.f_cred to ap->a_cred so that the invocations of soo_ioctl() are provided access to the calling f_cred. Pass ap->a_td->td_ucred as the active_cred, but note that this is required because we don't yet distinguish file_cred and active_cred in invoking VOP's. - Update kqueue_ioctl() for its new argument. - Update pipe_ioctl() for its new argument, pass active_cred rather than td_ucred to MAC for authorization. - Update soo_ioctl() for its new argument. - Update vn_ioctl() for its new argument, use active_cred rather than td->td_ucred to authorize VOP_IOCTL() and the associated VOP_GETATTR(). Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# ea6027a8	15-Aug-2002	Robert Watson <rwatson@FreeBSD.org>	Make similar changes to fo_stat() and fo_poll() as made earlier to fo_read() and fo_write(): explicitly use the cred argument to fo_poll() as "active_cred" using the passed file descriptor's f_cred reference to provide access to the file credential. Add an active_cred argument to fo_stat() so that implementers have access to the active credential as well as the file credential. Generally modify callers of fo_stat() to pass in td->td_ucred rather than fp->f_cred, which was redundantly provided via the fp argument. This set of modifications also permits threads to perform these operations on behalf of another thread without modifying their credential. Trickle this change down into fo_stat/poll() implementations: - badfo_poll(), badfo_stat(): modify/add arguments. - kqueue_poll(), kqueue_stat(): modify arguments. - pipe_poll(), pipe_stat(): modify/add arguments, pass active_cred to MAC checks rather than td->td_ucred. - soo_poll(), soo_stat(): modify/add arguments, pass fp->f_cred rather than cred to pru_sopoll() to maintain current semantics. - sopoll(): moidfy arguments. - vn_poll(), vn_statfile(): modify/add arguments, pass new arguments to vn_stat(). Pass active_cred to MAC and fp->f_cred to VOP_POLL() to maintian current semantics. - vn_close(): rename cred to file_cred to reflect reality while I'm here. - vn_stat(): Add active_cred and file_cred arguments to vn_stat() and consumers so that this distinction is maintained at the VFS as well as 'struct file' layer. Pass active_cred instead of td->td_ucred to MAC and to VOP_GETATTR() to maintain current semantics. - fifofs: modify the creation of a "filetemp" so that the file credential is properly initialized and can be used in the socket code if desired. Pass ap->a_td->td_ucred as the active credential to soo_poll(). If we teach the vnop interface about the distinction between file and active credentials, we would use the active credential here. Note that current inconsistent passing of active_cred vs. file_cred to VOP's is maintained. It's not clear why GETATTR would be authorized using active_cred while POLL would be authorized using file_cred at the file system level. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 9ca43589	15-Aug-2002	Robert Watson <rwatson@FreeBSD.org>	In order to better support flexible and extensible access control, make a series of modifications to the credential arguments relating to file read and write operations to cliarfy which credential is used for what: - Change fo_read() and fo_write() to accept "active_cred" instead of "cred", and change the semantics of consumers of fo_read() and fo_write() to pass the active credential of the thread requesting an operation rather than the cached file cred. The cached file cred is still available in fo_read() and fo_write() consumers via fp->f_cred. These changes largely in sys_generic.c. For each implementation of fo_read() and fo_write(), update cred usage to reflect this change and maintain current semantics: - badfo_readwrite() unchanged - kqueue_read/write() unchanged pipe_read/write() now authorize MAC using active_cred rather than td->td_ucred - soo_read/write() unchanged - vn_read/write() now authorize MAC using active_cred but VOP_READ/WRITE() with fp->f_cred Modify vn_rdwr() to accept two credential arguments instead of a single credential: active_cred and file_cred. Use active_cred for MAC authorization, and select a credential for use in VOP_READ/WRITE() based on whether file_cred is NULL or not. If file_cred is provided, authorize the VOP using that cred, otherwise the active credential, matching current semantics. Modify current vn_rdwr() consumers to pass a file_cred if used in the context of a struct file, and to always pass active_cred. When vn_rdwr() is used without a file_cred, pass NOCRED. These changes should maintain current semantics for read/write, but avoid a redundant passing of fp->f_cred, as well as making it more clear what the origin of each credential is in file descriptor read/write operations. Follow-up commits will make similar changes to other file descriptor operations, and modify the MAC framework to pass both credentials to MAC policy modules so they can implement either semantic for revocation. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 0231c03d	12-Aug-2002	Robert Watson <rwatson@FreeBSD.org>	Implement IO_NOMACCHECK in vn_rdwr() -- perform MAC checks (assuming 'options MAC') as long as IO_NOMACCHECK is not set in the IO flags. If IO_NOMACCHECK is set, bypass MAC checks in vn_rdwr(). This allows vn_rdwr() to be used as a utility function inside of file systems where MAC checks have already been performed, or where the operation is being done on behalf of the kernel not the user. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI LAbs
# 92e35b60	07-Aug-2002	Robert Watson <rwatson@FreeBSD.org>	Due to layering problems, remove the MAC checks from vn_rdwr() -- this VOP wrapper is called from within file systems so can result in odd loopback effects when MAC enforcement is use with the active (as opposed to saved) credential. These checks will be moved elsewhere. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# e6e370a7	04-Aug-2002	Jeff Roberson <jeff@FreeBSD.org>	- Replace v_flag with v_iflag and v_vflag - v_vflag is protected by the vnode lock and is used when synchronization with VOP calls is needed. - v_iflag is protected by interlock and is used for dealing with vnode management issues. These flags include X/O LOCK, FREE, DOOMED, etc. - All accesses to v_iflag and v_vflag have either been locked or marked with mp_fixme's. - Many ASSERT_VOP_LOCKED calls have been added where the locking was not clear. - Many functions in vfs_subr.c were restructured to provide for stronger locking. Idea stolen from: BSD/OS
# ee0812f3	01-Aug-2002	Robert Watson <rwatson@FreeBSD.org>	Since we have the struct file data pointer cached in vp, use that instead when invoking VOP_POLL().
# 4a58340e	01-Aug-2002	Robert Watson <rwatson@FreeBSD.org>	Introduce support for Mandatory Access Control and extensible kernel access control Invoke appropriate MAC framework entry points to authorize a number of vnode operations, including read, write, stat, poll. This permits MAC policies to revoke access to files following label changes, and to limit information spread about the file to user processes. Note: currently the file cached credential is used for some of these authorization check. We will need to expand some of the MAC entry point APIs to permit multiple creds to be passed to the access control check to allow diverse policy behavior. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 37bde6c0	01-Aug-2002	Robert Watson <rwatson@FreeBSD.org>	Introduce support for Mandatory Access Control and extensible kernel access control. Restructure the vn_open_cred() access control checks to invoke the MAC entry point for open authorization. Note that MAC can reject open requests where existing DAC code skips the open authorization check due to O_CREAT. However, the failure mode here is the same as other failure modes following creation, wherein an empty file may be left behind. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 4eee8de7	30-Jul-2002	Dag-Erling Smørgrav <des@FreeBSD.org>	Introduce struct xvnode, which will be used instead of struct vnode for sysctl purposes. Also add two fields to struct vnode, v_cachedfs and v_cachedid, which hold the vnode's device and file id and are filled in by vn_open_cred() and vn_stat(). Sponsored by: DARPA, NAI Labs
# 0b1040cb	21-Jul-2002	Robert Watson <rwatson@FreeBSD.org>	Set VAPPEND in open mode when O_APPEND is specified as an argument to open() of fhopen(). Currently this has no actual affect due to the treatment of VAPPEND in vaccess() and vaccess_acl() as a subset of VWRITE, but when MAC comes in, MAC will distinguish the two. Note: if any file systems are cutting their own permission models, they may wish to now take this into account. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# faab4e27	16-Jul-2002	Kirk McKusick <mckusick@FreeBSD.org>	Change the name of st_createtime to st_birthtime. This change is made to reduce confusion between st_ctime and st_createtime. Submitted by: Eric Allman <eric@sendmail.org> Sponsored by: DARPA & NAI Labs.
# 7f05b035	28-Jun-2002	Alfred Perlstein <alfred@FreeBSD.org>	More caddr_t removal, make fo_ioctl take a void * instead of a caddr_t.
# 5c71bc6c	28-Jun-2002	Jeff Roberson <jeff@FreeBSD.org>	Clean up vn_rdwr locking. - Do shared locks on read. - Only do vn_{start,finished}_write when writing.
# d374764f	24-Jun-2002	Kirk McKusick <mckusick@FreeBSD.org>	Use proper size in bzero of stat structure. Submitted by: Jake Burkholder <jake@locore.ca> Sponsored by: DARPA & NAI Labs.
# 6524dddc	22-Jun-2002	Kirk McKusick <mckusick@FreeBSD.org>	This patch fixes a size problem with the stat structure for 64-bit architectures that was introduced in the UFS2 code merge two days ago. The stat structure change that caused the problem was the addition of the file create time. Submitted by: Bruce Evans <bde@zeta.org.au> Sponsored by: DARPA & NAI Labs.
# 1c85e6a3	21-Jun-2002	Kirk McKusick <mckusick@FreeBSD.org>	This commit adds basic support for the UFS2 filesystem. The UFS2 filesystem expands the inode to 256 bytes to make space for 64-bit block pointers. It also adds a file-creation time field, an ability to use jumbo blocks per inode to allow extent like pointer density, and space for extended attributes (up to twice the filesystem block size worth of attributes, e.g., on a 16K filesystem, there is space for 32K of attributes). UFS2 fully supports and runs existing UFS1 filesystems. New filesystems built using newfs can be built in either UFS1 or UFS2 format using the -O option. In this commit UFS1 is the default format, so if you want to build UFS2 format filesystems, you must specify -O 2. This default will be changed to UFS2 when UFS2 proves itself to be stable. In this commit the boot code for reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c) as there is insufficient space in the boot block. Once the size of the boot block is increased, this code can be defined. Things to note: the definition of SBSIZE has changed to SBLOCKSIZE. The header file <ufs/ufs/dinode.h> must be included before <ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and ufs_lbn_t. Still TODO: Verify that the first level bootstraps work for all the architectures. Convert the utility ffsinfo to understand UFS2 and test growfs. Add support for the extended attribute storage. Update soft updates to ensure integrity of extended attribute storage. Switch the current extended attribute interfaces to use the extended attribute storage. Add the extent like functionality (framework is there, but is currently never used). Sponsored by: DARPA & NAI Labs. Reviewed by: Poul-Henning Kamp <phk@freebsd.org>
# 0e2d6cc8	14-May-2002	Jeff Roberson <jeff@FreeBSD.org>	Disable the shared locking namei() code for now. It breaks several stacking filesystems. This is on hold until the rest of VFS Locking is reviewed and deemed safe. It can be enabled with 'options LOOKUP_SHARED'.
# ba626c1d	16-Apr-2002	John Baldwin <jhb@FreeBSD.org>	Lock proctree_lock instead of pgrpsess_lock.
# 79a3e970	14-Apr-2002	Jeff Roberson <jeff@FreeBSD.org>	Use VOP_GETVOBJECT instead of accessing the member directly. This fixed an issue with nullfs and NAMEI shared. Submitted by: Alexander Kabaev
# a59f8b9e	08-Apr-2002	Jeff Roberson <jeff@FreeBSD.org>	Turn #ifdef LOOKUP_SHARED into #ifndef LOOKUP_EXCLUSIVE to enable this behavior by default. Also, change the options line to reflect this. If there are no problems reported this will become the only behavior and the knob will be removed in a month or so. Demanded by: obrien
# 44731cab	01-Apr-2002	John Baldwin <jhb@FreeBSD.org>	Change the suser() API to take advantage of td_ucred as well as do a general cleanup of the API. The entire API now consists of two functions similar to the pre-KSE API. The suser() function takes a thread pointer as its only argument. The td_ucred member of this thread must be valid so the only valid thread pointers are curthread and a few kernel threads such as thread0. The suser_cred() function takes a pointer to a struct ucred as its first argument and an integer flag as its second argument. The flag is currently only used for the PRISON_ROOT flag. Discussed on: smp@
# 237e41fc	25-Mar-2002	Bruce Evans <bde@FreeBSD.org>	Added used include of <sys/sx.h>. Don't depend on namespace pollution in <sys/file.h>.
# 4d77a549	19-Mar-2002	Alfred Perlstein <alfred@FreeBSD.org>	Remove __P.
# 628abf6c	15-Mar-2002	Alfred Perlstein <alfred@FreeBSD.org>	Giant pushdown for read/write/pread/pwrite syscalls. kern/kern_descrip.c: Aquire Giant in fdrop_locked when file refcount hits zero, this removes the requirement for the caller to own Giant for the most part. kern/kern_ktrace.c: Aquire Giant in ktrgenio, simplifies locking in upper read/write syscalls. kern/vfs_bio.c: Aquire Giant in bwillwrite if needed. kern/sys_generic.c Giant pushdown, remove Giant for: read, pread, write and pwrite. readv and writev aren't done yet because of the possible malloc calls for iov to uio processing. kern/sys_socket.c Grab giant in the socket fo_read/write functions. kern/vfs_vnops.c Grab giant in the vnode fo_read/write functions.
# 8de00f4a	11-Mar-2002	Jeff Roberson <jeff@FreeBSD.org>	This patch adds the "LOCKSHARED" option to namei which causes it to only acquire shared locks on leafs. The stat() and open() calls have been changed to make use of this new functionality. Using shared locks in these cases is sufficient and can significantly reduce their latency if IO is pending to these vnodes. Also, this reduces the number of exclusive locks that are floating around in the system, which helps reduce the number of deadlocks that occur. A new kernel option "LOOKUP_SHARED" has been added. It defaults to off so this patch can be turned on for testing, and should eventually go away once it is proven to be stable. I have personally been running this patch for over a year now, so it is believed to be fully stable. Reviewed by: jake, obrien Approved by: jake
# 183ccde6	11-Mar-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Stop abusing the pgrpsess_lock.
# eb8e6d52	05-Mar-2002	Eivind Eklund <eivind@FreeBSD.org>	Document all functions, global and static variables, and sysctls. Includes some minor whitespace changes, and re-ordering to be able to document properly (e.g, grouping of variables and the SYSCTL macro calls for them, where the documentation has been added.) Reviewed by: phk (but all errors are mine)
# a854ed98	27-Feb-2002	John Baldwin <jhb@FreeBSD.org>	Simple p_ucred -> td_ucred changes to start using the per-thread ucred reference.
# f591779b	23-Feb-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Lock struct pgrp, session and sigio. New locks are: - pgrpsess_lock which locks the whole pgrps and sessions, - pg_mtx which protects the pgrp members, and - s_mtx which protects the session members. Please refer to sys/proc.h for the coverage of these locks. Changes on the pgrp/session interface: - pgfind() needs the pgrpsess_lock held. - The caller of enterpgrp() is responsible to allocate a new pgrp and session. - Call enterthispgrp() in order to enter an existing pgrp. - pgsignal() requires a pgrp lock held. Reviewed by: jhb, alfred Tested on: cvsup.jp.FreeBSD.org (which is a quad-CPU machine running -current)
# ec20f901	19-Feb-2002	Robert Watson <rwatson@FreeBSD.org>	More cleanups relating to vm object allocation failure: make sure we call VOP_CLOSE() with vp unlocked; clean up the return path a little, in as much as our namei/vnode operation return paths can be cleared up. For a return case that was apparently never taken, this sure is ugly. Reviewed by: jeffr
# b01bcf4c	17-Feb-2002	Ian Dowse <iedowse@FreeBSD.org>	Add the braces missed by revision 1.131. Pointy hat to: rwatson
# 4729fbd8	17-Feb-2002	Robert Watson <rwatson@FreeBSD.org>	When vn_open() is failing because it cannot allocate a vm object, call VOP_CLOSE() on the vnode, so that VOP_OPEN() and VOP_CLOSE() calls are symmetric in all failure cases. This prevents an 'open' reference from being leaked in that unlikely failure scenario.
# 1ea030d8	10-Feb-2002	Robert Watson <rwatson@FreeBSD.org>	Make sure to hold vnode lock when calling into VOP_GETATTR(). Discussed with: mckusick, phk
# 74237f55	09-Feb-2002	Robert Watson <rwatson@FreeBSD.org>	Part I: Update extended attribute API and ABI: o Modify the system call syntax for extattr_{get,set}_{fd,file}() so as not to use the scatter gather API (which appeared not to be used by any consumers, and be less portable), rather, accepts 'data' and 'nbytes' in the style of other simple read/write interfaces. This changes the API and ABI. o Modify system call semantics so that extattr_get_{fd,file}() return a size_t. When performing a read, the number of bytes read will be returned, unless the data pointer is NULL, in which case the number of bytes of data are returned. This changes the API only. o Modify the VOP_GETEXTATTR() vnode operation to accept a *size_t argument so as to return the size, if desirable. If set to NULL, the size will not be returned. o Update various filesystems (pseodofs, ufs) to DTRT. These changes should make extended attributes more useful and more portable. More commits to rebuild the system call files, as well as update userland utilities to follow. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 3e728224	25-Jan-2002	Poul-Henning Kamp <phk@FreeBSD.org>	Make st_blksize default to PAGE_SIZE instead of zero.
# c73df808	18-Jan-2002	Matthew Dillon <dillon@FreeBSD.org>	Remove 'VXLOCK: interlock avoided' warnings. This can now occur in normal operation. The vgonel() code has always called vclean() but until we started proactively freeing vnodes it would never actually be called with a dirty vnode, so this situation did not occur prior to the vnlru() code. Now that we proactively free vnodes when kern.maxvnodes is hit, however, vclean() winds up with work to do and improperly generates the warnings. Reviewed by: peter Approved by: re (for MFC) MFC after: 1 day
# 426da3bc	13-Jan-2002	Alfred Perlstein <alfred@FreeBSD.org>	SMP Lock struct file, filedesc and the global file list. Seigo Tanimura (tanimura) posted the initial delta. I've polished it quite a bit reducing the need for locking and adapting it for KSE. Locks: 1 mutex in each filedesc protects all the fields. protects "struct file" initialization, while a struct file is being changed from &badfileops -> &pipeops or something the filedesc should be locked. 1 mutex in each struct file protects the refcount fields. doesn't protect anything else. the flags used for garbage collection have been moved to f_gcflag which was the FILLER short, this doesn't need locking because the garbage collection is a single threaded container. could likely be made to use a pool mutex. 1 sx lock for the global filelist. struct file * fhold(struct file fp); / increments reference count on a file / struct file fhold_locked(struct file fp); / like fhold but expects file to locked / struct file ffind_hold(struct thread , int fd); / finds the struct file in thread, adds one reference and returns it unlocked / struct file ffind_lock(struct thread , int fd); / ffind_hold, but returns file locked */ I still have to smp-safe the fget cruft, I'll get to that asap.
# fdb33f08	18-Dec-2001	Matthew Dillon <dillon@FreeBSD.org>	This is a forward port of Peter's vlrureclaim() fix, with some minor mods by me to make it more efficient. The original code had serious balancing problems and could also deadlock easily. This code relegates the vnode reclamation to its own kproc and relaxes the vnode reclamation requirements to better maintain kern.maxvnodes. This code still doesn't balance as well as it could, but it does a much better job then the original code. Approved by: re@freebsd.org Obtained from: ps, peter, dillon MFS Assuming: Assuming no problems crop up in Yahoo testing MFC after: 7 days
# f03e89de	11-Nov-2001	Alfred Perlstein <alfred@FreeBSD.org>	turn vn_open() into a wrapper around vn_open_cred() which allows one to perform a vn_open using temporary/other/fake credentials. Modify the nfs client side locking code to use vn_open_cred() passing proc0's ucred instead of the old way which was to temporary raise privs while running vn_open(). This should close the race hopefully.
# fc2749a4	23-Oct-2001	Robert Watson <rwatson@FreeBSD.org>	o vn_open() fails to call VOP_CLOSE() if vfs_object_create fails. Ideally all successful calls to VOP_OPEN() might be reflected in a call to VOP_CLOSE(). For now, simply add a comment reflecting this problem; this should be fixed at some point.
# 7106ca0d	11-Oct-2001	John Baldwin <jhb@FreeBSD.org>	Add missing includes of sys/lock.h.
# 3418ebeb	26-Sep-2001	Matthew Dillon <dillon@FreeBSD.org>	Make uio_yield() a global. Call uio_yield() between chunks in vn_rdwr_inchunks(), allowing other processes to gain an exclusive lock on the vnode. Specifically: directory scanning, to avoid a race to the root directory, and multiple child processes coring simultaniously so they can figure out that some other core'ing child has an exclusive adv lock and just exit instead. This completely fixes performance problems when large programs core. You can have hundreds of copies (forked children) of the same binary core all at once and not notice. MFC after: 3 days
# b40ce416	12-Sep-2001	Julian Elischer <julian@FreeBSD.org>	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
# 06ae1e91	08-Sep-2001	Matthew Dillon <dillon@FreeBSD.org>	This brings in a Yahoo coredump patch from Paul, with additional mods by me (addition of vn_rdwr_inchunks). The problem Yahoo is solving is that if you have large process images core dumping, or you have a large number of forked processes all core dumping at the same time, the original coredump code would leave the vnode locked throughout. This can cause the directory vnode to get locked up, which can cause the parent directory vnode to get locked up, and so on all the way to the root node, locking the entire machine up for extremely long periods of time. This patch solves the problem in two ways. First it uses an advisory non-blocking lock to abort multiple processes trying to core to the same file. Second (my contribution) it chunks up the writes and uses bwillwrite() to avoid holding the vnode locked while blocking in the buffer cache. Submitted by: ps Reviewed by: dillon MFC after: 2 weeks
# 5d97bedb	23-Aug-2001	Andrey A. Chernov <ache@FreeBSD.org>	vn_stat(): if va_size (u_quad_t) > OFF_MAX, return EOVERFLOW, don't copy it blindly to st_size
# ac8f990b	24-May-2001	Matthew Dillon <dillon@FreeBSD.org>	This patch implements O_DIRECT about 80% of the way. It takes a patchset Tor created a while ago, removes the raw I/O piece (that has cache coherency problems), and adds a buffer cache / VM freeing piece. Essentially this patch causes O_DIRECT I/O to not be left in the cache, but does not prevent it from going through the cache, hence the 80%. For the last 20% we need a method by which the I/O can be issued directly to buffer supplied by the user process and bypass the buffer cache entirely, but still maintain cache coherency. I also have the code working under -stable but the changes made to sys/file.h may not be MFCable, so an MFC is not on the table yet. Submitted by: tegge, dillon
# 60fb0ce3	28-Apr-2001	Greg Lehey <grog@FreeBSD.org>	Revert consequences of changes to mount.h, part 2. Requested by: bde
# 112f7372	25-Apr-2001	Kirk McKusick <mckusick@FreeBSD.org>	When closing the last reference to an unlinked file, it is freed by the inactive routine. Because the freeing causes the filesystem to be modified, the close must be held up during periods when the filesystem is suspended. For snapshots to be consistent across crashes, they must write blocks that they copy and claim those written blocks in their on-disk block pointers before the old blocks that they referenced can be allowed to be written. Close a loophole that allowed unwritten blocks to be skipped when doing ffs_sync with a request to wait for all I/O activity to be completed.
# d98dc34f	23-Apr-2001	Greg Lehey <grog@FreeBSD.org>	Correct #includes to work with fixed sys/mount.h.
# 602ef631	25-Mar-2001	Boris Popov <bp@FreeBSD.org>	Previous commit broke interlock locking for !LK_RETRY case.
# 71d8277b	25-Mar-2001	Boris Popov <bp@FreeBSD.org>	Prevent race condition by using msleep() instead of mtx_unlock()/tsleep(). Reviewed by: alfred
# 30632071	18-Mar-2001	Robert Watson <rwatson@FreeBSD.org>	o Rename "namespace" argument to "attrnamespace" as namespace is a C++ reserved word. Submitted by: jkh Obtained from: TrustedBSD Project
# 70f36851	14-Mar-2001	Robert Watson <rwatson@FreeBSD.org>	o Change the API and ABI of the Extended Attribute kernel interfaces to introduce a new argument, "namespace", rather than relying on a first- character namespace indicator. This is in line with more recent thinking on EA interfaces on various mailing lists, including the posix1e, Linux acl-devel, and trustedbsd-discuss forums. Two namespaces are defined by default, EXTATTR_NAMESPACE_SYSTEM and EXTATTR_NAMESPACE_USER, where the primary distinction lies in the access control model: user EAs are accessible based on the normal MAC and DAC file/directory protections, and system attributes are limited to kernel-originated or appropriately privileged userland requests. o These API changes occur at several levels: the namespace argument is introduced in the extattr_{get,set}_file() system call interfaces, at the vnode operation level in the vop_{get,set}extattr() interfaces, and in the UFS extended attribute implementation. Changes are also introduced in the VFS extattrctl() interface (system call, VFS, and UFS implementation), where the arguments are modified to include a namespace field, as well as modified to advoid direct access to userspace variables from below the VFS layer (in the style of recent changes to mount by adrian@FreeBSD.org). This required some cleanup and bug fixing regarding VFS locks and the VFS interface, as a vnode pointer may now be optionally submitted to the VFS_EXTATTRCTL() call. Updated documentation for the VFS interface will be committed shortly. o In the near future, the auto-starting feature will be updated to search two sub-directories to the ".attribute" directory in appropriate file systems: "user" and "system" to locate attributes intended for those namespaces, as the single filename is no longer sufficient to indicate what namespace the attribute is intended for. Until this is committed, all attributes auto-started by UFS will be placed in the EXTATTR_NAMESPACE_SYSTEM namespace. o The default POSIX.1e attribute names for ACLs and Capabilities have been updated to no longer include the '$' in their filename. As such, if you're using these features, you'll need to rename the attribute backing files to the same names without '$' symbols in front. o Note that these changes will require changes in userland, which will be committed shortly. These include modifications to the extended attribute utilities, as well as to libutil for new namespace string conversion routines. Once the matching userland changes are committed, a buildworld is recommended to update all the necessary include files and verify that the kernel and userland environments are in sync. Note: If you do not use extended attributes (most people won't), upgrading is not imperative although since the system call API has changed, the new userland extended attribute code will no longer compile with old include files. o Couple of minor cleanups while I'm there: make more code compilation conditional on FFS_EXTATTR, which should recover a bit of space on kernels running without EA's, as well as update copyright dates. Obtained from: TrustedBSD Project
# 608a3ce6	15-Feb-2001	Jonathan Lemon <jlemon@FreeBSD.org>	Extend kqueue down to the device layer. Backwards compatible approach suggested by: peter
# 9ed346ba	08-Feb-2001	Bosko Milekic <bmilekic@FreeBSD.org>	Change and clean the mutex lock interface. mtx_enter(lock, type) becomes: mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks) mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized) similarily, for releasing a lock, we now have: mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN. We change the caller interface for the two different types of locks because the semantics are entirely different for each case, and this makes it explicitly clear and, at the same time, it rids us of the extra `type' argument. The enter->lock and exit->unlock change has been made with the idea that we're "locking data" and not "entering locked code" in mind. Further, remove all additional "flags" previously passed to the lock acquire/release routines with the exception of two: MTX_QUIET and MTX_NOSWITCH The functionality of these flags is preserved and they can be passed to the lock/unlock routines by calling the corresponding wrappers: mtx_{lock, unlock}_flags(lock, flag(s)) and mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN locks, respectively. Re-inline some lock acq/rel code; in the sleep lock case, we only inline the _obtain_lock()s in order to ensure that the inlined code fits into a cache line. In the spin lock case, we inline recursion and actually only perform a function call if we need to spin. This change has been made with the idea that we generally tend to avoid spin locks and that also the spin locks that we do have and are heavily used (i.e. sched_lock) do recurse, and therefore in an effort to reduce function call overhead for some architectures (such as alpha), we inline recursion for this case. Create a new malloc type for the witness code and retire from using the M_DEV type. The new type is called M_WITNESS and is only declared if WITNESS is enabled. Begin cleaning up some machdep/mutex.h code - specifically updated the "optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently need those. Finally, caught up to the interface changes in all sys code. Contributors: jake, jhb, jasone (in no particular order)
# 1b367556	23-Jan-2001	Jason Evans <jasone@FreeBSD.org>	Convert all simplelocks to mutexes and remove the simplelock implementations.
# 936524aa	18-Nov-2000	Matthew Dillon <dillon@FreeBSD.org>	Implement a low-memory deadlock solution. Removed most of the hacks that were trying to deal with low-memory situations prior to now. The new code is based on the concept that I/O must be able to function in a low memory situation. All major modules related to I/O (except networking) have been adjusted to allow allocation out of the system reserve memory pool. These modules now detect a low memory situation but rather then block they instead continue to operate, then return resources to the memory pool instead of cache them or leave them wired. Code has been added to stall in a low-memory situation prior to a vnode being locked. Thus situations where a process blocks in a low-memory condition while holding a locked vnode have been reduced to near nothing. Not only will I/O continue to operate, but many prior deadlock conditions simply no longer exist. Implement a number of VFS/BIO fixes (found by Ian): in biodone(), bogus-page replacement code, the loop was not properly incrementing loop variables prior to a continue statement. We do not believe this code can be hit anyway but we aren't taking any chances. We'll turn the whole section into a panic (as it already is in brelse()) after the release is rolled. In biodone(), the foff calculation was incorrectly clamped to the iosize, causing the wrong foff to be calculated for pages in the case of an I/O error or biodone() called without initiating I/O. The problem always caused a panic before. Now it doesn't. The problem is mainly an issue with NFS. Fixed casts for ~PAGE_MASK. This code worked properly before only because the calculations use signed arithmatic. Better to properly extend PAGE_MASK first before inverting it for the 64 bit masking op. In brelse(), the bogus_page fixup code was improperly throwing away the original contents of 'm' when it did the j-loop to fix the bogus pages. The result was that it would potentially invalidate parts of the WRONG page(!), leading to corruption. There may still be cases where a background bitmap write is being duplicated, causing potential corruption. We have identified a potentially serious bug related to this but the fix is still TBD. So instead this patch contains a KASSERT to detect the problem and panic the machine rather then continue to corrupt the filesystem. The problem does not occur very often.. it is very hard to reproduce, and it may or may not be the cause of the corruption people have reported. Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>) Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
# 1d7e3e42	02-Nov-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Take VBLK devices further out of their missery. This should fix the panic I introduced in my previous commit on this topic.
# 35e0e5b3	20-Oct-2000	John Baldwin <jhb@FreeBSD.org>	Catch up to moving headers: - machine/ipl.h -> sys/ipl.h - machine/mutex.h -> sys/mutex.h
# a18b1f1d	03-Oct-2000	Jason Evans <jasone@FreeBSD.org>	Convert lockmgr locks from using simple locks to using mutexes. Add lockdestroy() and appropriate invocations, which corresponds to lockinit() and must be called to clean up after a lockmgr lock is no longer needed.
# 100d2c18	22-Sep-2000	Robert Watson <rwatson@FreeBSD.org>	o Introduce vn_extattr_rm(), a helper function in the style of vn_extattr_get() and vn_extattr_set(). vn_extattr_rm() removes the specified extended attribute from a vnode, authorizing the change as the kernel (NULL cred). Obtained from: TrustedBSD Project
# e81c5f43	04-Sep-2000	Robert Watson <rwatson@FreeBSD.org>	o vn_extattr_set() will now call appropriate vn_start_write() and vn_finished_write() if IO_NODELOCKED is not set. Obtained from: TrustedBSD Project
# e6a9ab52	08-Aug-2000	Robert Watson <rwatson@FreeBSD.org>	o Introduce vn_extattr_{get,set}, wrapper routines for VOP_GETEXTATTR and VOP_SETEXTATTR to simplify calling from in-kernel consumers, such as capability code. Both accept a vnode (optionally locked, with ioflg to indicate that), attribute name, and a buffer + buffer length in UIO_SYSSPACE. Both authorize the call as a kernel request, with cred set to NULL for the actual VOP_ calls. Obtained from: TrustedBSD Project
# 9b971133	23-Jul-2000	Kirk McKusick <mckusick@FreeBSD.org>	This patch corrects the first round of panics and hangs reported with the new snapshot code. Update addaliasu to correctly implement the semantics of the old checkalias function. When a device vnode first comes into existence, check to see if an anonymous vnode for the same device was created at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than creating a new vnode for the device. This corrects a problem which caused the kernel to panic when taking a snapshot of the root filesystem. Change the calling convention of vn_write_suspend_wait() to be the same as vn_start_write(). Split out softdep_flushworklist() from softdep_flushfiles() so that it can be used to clear the work queue when suspending filesystem operations. Access to buffers becomes recursive so that snapshots can recursively traverse their indirect blocks using ffs_copyonwrite() when checking for the need for copy on write when flushing one of their own indirect blocks. This eliminates a deadlock between the syncer daemon and a process taking a snapshot. Ensure that softdep_process_worklist() can never block because of a snapshot being taken. This eliminates a problem with buffer starvation. Cleanup change in ffs_sync() which did not synchronously wait when MNT_WAIT was specified. The result was an unclean filesystem panic when doing forcible unmount with heavy filesystem I/O in progress. Return a zero'ed block when reading a block that was not in use at the time that a snapshot was taken. Normally, these blocks should never be read. However, the readahead code will occationally read them which can cause unexpected behavior. Clean up the debugging code that ensures that no blocks be written on a filesystem while it is suspended. Snapshots must explicitly label the blocks that they are writing during the suspension so that they do not cause a `write on suspended filesystem' panic. Reorganize ffs_copyonwrite() to eliminate a deadlock and also to prevent a race condition that would permit the same block to be copied twice. This change eliminates an unexpected soft updates inconsistency in fsck caused by the double allocation. Use bqrelse rather than brelse for buffers that will be needed soon again by the snapshot code. This improves snapshot performance.
# f2a2857b	11-Jul-2000	Kirk McKusick <mckusick@FreeBSD.org>	Add snapshots to the fast filesystem. Most of the changes support the gating of system calls that cause modifications to the underlying filesystem. The gating can be enabled by any filesystem that needs to consistently suspend operations by adding the vop_stdgetwritemount to their set of vnops. Once gating is enabled, the function vfs_write_suspend stops all new write operations to a filesystem, allows any filesystem modifying system calls already in progress to complete, then sync's the filesystem to disk and returns. The function vfs_write_resume allows the suspended write operations to begin again. Gating is not added by default for all filesystems as for SMP systems it adds two extra locks to such critical kernel paths as the write system call. Thus, gating should only be added as needed. Details on the use and current status of snapshots in FFS can be found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness is not included here. Unless and until you create a snapshot file, these changes should have no effect on your system (famous last words).
# e6796b67	03-Jul-2000	Kirk McKusick <mckusick@FreeBSD.org>	Move the truncation code out of vn_open and into the open system call after the acquisition of any advisory locks. This fix corrects a case in which a process tries to open a file with a non-blocking exclusive lock. Even if it fails to get the lock it would still truncate the file even though its open failed. With this change, the truncation is done only after the lock is successfully acquired. Obtained from: BSD/OS
# cb5ad9d3	25-Jun-2000	Jonathan Lemon <jlemon@FreeBSD.org>	Fix stupid braino in last commit, initialize `vp' before we test vp->v_tag. Spotted by: dillon
# c8bea19e	22-Jun-2000	Jonathan Lemon <jlemon@FreeBSD.org>	Add a hack to fail registration of kq events on a non-ufs filesystem, as support for those is non-existent at the moment.
# e3975643	25-May-2000	Jake Burkholder <jake@FreeBSD.org>	Back out the previous change to the queue(3) interface. It was not discussed and should probably not happen. Requested by: msmith and others
# 740a1973	23-May-2000	Jake Burkholder <jake@FreeBSD.org>	Change the way that the queue(3) structures are declared; don't assume that the type argument to _HEAD and _ENTRY is a struct. Suggested by: phk Reviewed by: phk Approved by: mdodd
# 37d90a44	12-May-2000	Jeroen Ruigrok van der Werven <asmodai@FreeBSD.org>	Fix comment typo. Submitted by: nrahlstr
# 9626b608	05-May-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Separate the struct bio related stuff out of <sys/buf.h> into <sys/bio.h>. <sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall not be made a nested include according to bdes teachings on the subject of nested includes. Diskdrivers and similar stuff below specfs::strategy() should no longer need to include <sys/buf.> unless they need caching of data. Still a few bogus uses of struct buf to track down. Repocopy by: peter
# 2c9b67a8	30-Apr-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Remove unneeded #include <vm/vm_zone.h> Generated by: src/tools/tools/kerninclude
# cb679c38	16-Apr-2000	Jonathan Lemon <jlemon@FreeBSD.org>	Introduce kqueue() and kevent(), a kernel event notification facility.
# e4649cfa	01-Apr-2000	Matthew Dillon <dillon@FreeBSD.org>	Change the write-behind code to take more care when starting async I/O's. The sequential read heuristic has been extended to cover writes as well. We continue to call cluster_write() normally, thus blocks in the file will still be reallocated for large (but still random) I/O's, but I/O will only be initiated for truely sequential writes. This solves a number of annoying situations, especially with DBM (hash method) writes, and also has the side effect of fixing a number of (stupid) benchmarks. Reviewed-by: mckusick
# ba4ad1fc	09-Jan-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Give vn_isdisk() a second argument where it can return a suitable errno. Suggested by: bde
# bd5f5da9	09-Jan-2000	Kirk McKusick <mckusick@FreeBSD.org>	Add bwillwrite to all system calls that create things in the filesystem. Benchmarks that create huge trees of empty files overwhelm the buffer cache.
# 762e6b85	15-Dec-1999	Eivind Eklund <eivind@FreeBSD.org>	Introduce NDFREE (and remove VOP_ABORTOP)
# 91921bd5	18-Nov-1999	Matthew Dillon <dillon@FreeBSD.org>	Ensure that garbage from the kernel stack does not wind up being returned to user mode in the spare fields of the stat structure. PR: kern/14966 Reviewed by: dillon@freebsd.org Submitted by: Kelly Yancey kbyanc@posi.net
# b127fae4	07-Nov-1999	Peter Wemm <peter@FreeBSD.org>	Add a vnode fo_stat() entry point.
# 13ccadd4	19-Sep-1999	Brian Feldman <green@FreeBSD.org>	This is what was "fdfix2.patch," a fix for fd sharing. It's pretty far-reaching in fd-land, so you'll want to consult the code for changes. The biggest change is that now, you don't use fp->f_ops->fo_foo(fp, bar) but instead fo_foo(fp, bar), which increments and decrements the fp refcount upon entry and exit. Two new calls, fhold() and fdrop(), are provided. Each does what it seems like it should, and if fdrop() brings the refcount to zero, the fd is freed as well. Thanks to peter ("to hell with it, it looks ok to me.") for his review. Thanks to msmith for keeping me from putting locks everywhere :) Reviewed by: peter
# 85a219d2	09-Sep-1999	Julian Elischer <julian@FreeBSD.org>	Changes to centralise the default blocksize behaviour. More likely to follow. Submitted by: phk@freebsd.org
# 7012bab9	02-Sep-1999	Julian Elischer <julian@FreeBSD.org>	Revert a bunch of contraversial changes by PHK. After a quick think and discussion among various people some form of some of these changes will probably be recommitted. The reversion requested was requested by dg while discussions proceed. PHK has indicated that he can live with this, and it has been agreed that some form of some of these changes may return shortly after further discussion.
# de5f40af	31-Aug-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Improve the returned values in st_blksize a little bit, avoid accessing union fields not valid for dev_t type.
# 02e15769	30-Aug-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Make bdev userland access work like cdev userland access unless the highly non-recommended option ALLOW_BDEV_ACCESS is used. (bdev access is evil because you don't get write errors reported.) Kill si_bsize_best before it kills Matt :-) Use the specfs routines rather having cloned copies in devfs.
# c3aac50f	27-Aug-1999	Peter Wemm <peter@FreeBSD.org>	$Id$ -> $FreeBSD$
# b5fca1cb	27-Aug-1999	Brian Feldman <green@FreeBSD.org>	Add FIODTYPE ioctl for getting d_flags (type) info on a device. Okayed by: phk
# a431597b	25-Aug-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Add a couple of missing but unimportant break; statements.
# 0232a251	13-Aug-1999	Poul-Henning Kamp <phk@FreeBSD.org>	oops: Add missing include.
# 3a965c0d	13-Aug-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Move the special-casing of stat(2)->st_blksize for device files from UFS to the generic level. For chr/blk devices we don't care about the blocksize of the filesystem, we want what the device asked for.
# e32c66c5	04-Aug-1999	Brian Feldman <green@FreeBSD.org>	Fix fd race conditions (during shared fd table usage.) Badfileops is now used in f_ops in place of NULL, and modifications to the files are more carefully ordered. f_ops should also be set to &badfileops upon "close" of a file. This does not fix other problems mentioned in this PR than the first one. PR: 11629 Reviewed by: peter
# 67452993	26-Jul-1999	Alan Cox <alc@FreeBSD.org>	Add sysctl and support code to allow directories to be VMIO'd. The default setting for the sysctl is OFF, which is the historical operation. Submitted by: dillon
# ad8ac923	08-Jul-1999	Kirk McKusick <mckusick@FreeBSD.org>	These changes appear to give us benefits with both small (32MB) and large (1G) memory machine configurations. I was able to run 'dbench 32' on a 32MB system without bring the machine to a grinding halt. * buffer cache hash table now dynamically allocated. This will have no effect on memory consumption for smaller systems and will help scale the buffer cache for larger systems. * minor enhancement to pmap_clearbit(). I noticed that all the calls to it used constant arguments. Making it an inline allows the constants to propogate to deeper inlines and should produce better code. * removal of inherent vfs_ioopt support through the emplacement of appropriate #ifdef's, with John's permission. If we do not find a use for it by the end of the year we will remove it entirely. * removal of getnewbufloops* counters & sysctl's - no longer necessary for debugging, getnewbuf() is now optimal. * buffer hash table functions removed from sys/buf.h and localized to vfs_bio.c * VFS_BIO_NEED_DIRTYFLUSH flag and support code added ( bwillwrite() ), allowing processes to block when too many dirty buffers are present in the system. * removal of a softdep test in bdwrite() that is no longer necessary now that bdwrite() no longer attempts to flush dirty buffers. * slight optimization added to bqrelse() - there is no reason to test for available buffer space on B_DELWRI buffers. * addition of reverse-scanning code to vfs_bio_awrite(). vfs_bio_awrite() will attempt to locate clusterable areas in both the forward and reverse direction relative to the offset of the buffer passed to it. This will probably not make much of a difference now, but I believe we will start to rely on it heavily in the future if we decide to shift some of the burden of the clustering closer to the actual I/O initiation. * Removal of the newbufcnt and lastnewbuf counters that Kirk added. They do not fix any race conditions that haven't already been fixed by the gbincore() test done after the only call to getnewbuf(). getnewbuf() is a static, so there is no chance of it being misused by other modules. ( Unless Kirk can think of a specific thing that this code fixes. I went through it very carefully and didn't see anything ). * removal of VOP_ISLOCKED() check in flushbufqueues(). I do not think this check is necessary, the buffer should flush properly whether the vnode is locked or not. ( yes? ). * removal of extra arguments passed to getnewbuf() that are not necessary. * missed cluster_wbuild() that had to be a cluster_wbuild_wb() in vfs_cluster.c * vn_write() now calls bwillwrite() PRIOR to locking the vnode, which should greatly aid flushing operations in heavy load situations - both the pageout and update daemons will be able to operate more efficiently. * removal of b_usecount. We may add it back in later but for now it is useless. Prior implementations of the buffer cache never had enough buffers for it to be useful, and current implementations which make more buffers available might not benefit relative to the amount of sophistication required to implement a b_usecount. Straight LRU should work just as well, especially when most things are VMIO backed. I expect that (even though John will not like this assumption) directories will become VMIO backed some point soon. Submitted by: Matthew Dillon <dillon@backplane.com> Reviewed by: Kirk McKusick <mckusick@mckusick.com>
# 8947a90a	02-Jul-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Make sure that stat(2) and friends always return a valid st_dev field. Pseudo-FS need not fill in the va_fsid anymore, the syscall code will use the first half of the fsid, which now looks like a udev_t with major 255.
# 75c13541	28-Apr-1999	Poul-Henning Kamp <phk@FreeBSD.org>	This Implements the mumbled about "Jail" feature. This is a seriously beefed up chroot kind of thing. The process is jailed along the same lines as a chroot does it, but with additional tough restrictions imposed on what the superuser can do. For all I know, it is safe to hand over the root bit inside a prison to the customer living in that prison, this is what it was developed for in fact: "real virtual servers". Each prison has an ip number associated with it, which all IP communications will be coerced to use and each prison has its own hostname. Needless to say, you need more RAM this way, but the advantage is that each customer can run their own particular version of apache and not stomp on the toes of their neighbors. It generally does what one would expect, but setting up a jail still takes a little knowledge. A few notes: I have no scripts for setting up a jail, don't ask me for them. The IP number should be an alias on one of the interfaces. mount a /proc in each jail, it will make ps more useable. /proc/<pid>/status tells the hostname of the prison for jailed processes. Quotas are only sensible if you have a mountpoint per prison. There are no privisions for stopping resource-hogging. Some "#ifdef INET" and similar may be missing (send patches!) If somebody wants to take it from here and develop it into more of a "virtual machine" they should be most welcome! Tools, comments, patches & documentation most welcome. Have fun... Sponsored by: http://www.rndassociates.com/ Run for almost a year by: http://www.servetheweb.com/
# f711d546	27-Apr-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Suser() simplification: 1: s/suser/suser_xxx/ 2: Add new function: suser(struct proc ), prototyped in <sys/proc.h>. 3: s/suser_xxx($[a-zA-Z0-9_]$->p_ucred, \&\1->p_acflag)/suser(\1)/ The remaining suser_xxx() calls will be scrutinized and dealt with later. There may be some unneeded #include <sys/cred.h>, but they are left as an exercise for Bruce. More changes to the suser() API will come along with the "jail" code.
# f78fd73f	20-Apr-1999	Alan Cox <alc@FreeBSD.org>	Address several problems in vn_read and vn_write: 1. Make read-ahead work for pread and aio_read. 2. Fix one place where a comparison of uio_offset with -1 wasn't updated to use FOF_OFFSET. 3. Honor O_APPEND in the FOF_OFFSET case. In addition, use the variable name "ioflag" in both vn_read and vn_write to avoid possible confusion between the variable "flag" and the parameter "flags". Submitted by: Bruce Evans <bde@zeta.org.au> and me
# 8fe387ab	04-Apr-1999	Dmitrij Tejblum <dt@FreeBSD.org>	Add standard padding argument to pread and pwrite syscall. That should make them NetBSD compatible. Add parameter to fo_read and fo_write. (The only flag FOF_OFFSET mean that the offset is set in the struct uio). Factor out some common code from read/pread/write/pwrite syscalls.
# cde9bc87	26-Mar-1999	Alan Cox <alc@FreeBSD.org>	Changed vn_read/write such that fp->f_offset isn't touched if uio->uio_offset != -1. This fixes a problem with aio_read/write and permits a straightforward implementation of pread/pwrite. PR: kern/8669 Submitted by: John Plevyak <jplevyak@inktomi.com> Reviewed by: Matthew Dillon <dillon@apollo.backplane.com>
# 57c90d6f	29-Jan-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Use suser() to determine super-user-ness, don't examine cr_uid directly.
# 15a1057c	20-Jan-1999	Eivind Eklund <eivind@FreeBSD.org>	Add 'options DEBUG_LOCKS', which stores extra information in struct lock, and add some macros and function parameters to make sure that the information get to the point where it can be put in the lock structure. While I'm here, add DEBUG_VFS_LOCKS to LINT.
# fb116777	05-Jan-1999	Eivind Eklund <eivind@FreeBSD.org>	Remove the 'waslocked' parameter to vfs_object_create().
# f3d6ee09	01-Nov-1998	Peter Wemm <peter@FreeBSD.org>	Only do one VOP_ACCESS() per open() instead of two. This should reduce the NFSv3 ACCESS RPC problems a little for busy clients that do a lot of open/close. The nfs code could probably cache the results, but I'm not sure whether this would be legal or useful. The problem is that with a CPU farm, on each open there would be a lookup, getattr then access RPC then the read/write RPC activity. Caching the access results probably isn't going to help much if the clients access lots of files. Having the nfs_access() routine interpret the getattr results is a bit of a hack, but it's how NFSv2 is done and it might be OK for a mount attribute for v3.
# c259b8dd	27-Jun-1998	Poul-Henning Kamp <phk@FreeBSD.org>	Report the mode as the result of the VOP_GETATTR rather than the vnodes type, they may not correspond.
# ecbb00a2	07-Jun-1998	Doug Rabson <dfr@FreeBSD.org>	This commit fixes various 64bit portability problems required for FreeBSD/alpha. The most significant item is to change the command argument to ioctl functions from int to u_long. This change brings us inline with various other BSD versions. Driver writers may like to use (__FreeBSD_version == 300003) to detect this change. The prototype FreeBSD/alpha machdep will follow in a couple of days time.
# 7be2d300	06-May-1998	Mike Smith <msmith@FreeBSD.org>	In the words of the submitter: --------- Make callers of namei() responsible for releasing references or locks instead of having the underlying filesystems do it. This eliminates redundancy in all terminal filesystems and makes it possible for stacked transport layers such as umapfs or nullfs to operate correctly. Quality testing was done with testvn, and lat_fs from the lmbench suite. Some NFS client testing courtesy of Patrik Kudo. vop_mknod and vop_symlink still release the returned vpp. vop_rename still releases 4 vnode arguments before it returns. These remaining cases will be corrected in the next set of patches. --------- Submitted by: Michael Hancock <michaelh@cet.co.jp>
# 7c2e3d32	09-Apr-1998	Alexander Langer <alex@FreeBSD.org>	Grammar police.
# 5ddc8ded	08-Apr-1998	Wolfram Schneider <wosch@FreeBSD.org>	New mount option nosymfollow. If enabled, the kernel lookup() function will not follow symbolic links on the mounted file system and return EACCES (Permission denied).
# 100ceca2	06-Apr-1998	Peter Wemm <peter@FreeBSD.org>	Today is not my lucky day. Fix missing brace and I got a request to use EMLINK instead.
# 193afe01	06-Apr-1998	Peter Wemm <peter@FreeBSD.org>	Use a different errno (ELOOP (as sef mentioned) since the text that goes with the error sounds ok for the condition) if O_NOFOLLOW gets a link.
# 0fdc628b	06-Apr-1998	Peter Wemm <peter@FreeBSD.org>	Rather than let users get fd's to symlink files, make O_NOFOLLOW cause an error if it gets a link (like it does if it gets a socket). The implications of letting users try and do file operations on symlinks themselves were too worrying.
# 7e3426aa	06-Apr-1998	Peter Wemm <peter@FreeBSD.org>	Implement a new open(2) flag: O_NOFOLLOW. This will instruct open to not follow symlinks, but to open a handle on the link itself(!). As strange as this might sound, it has several useful applications safe race-free ways of opening files in hostile areas (eg: /tmp, a mode 1777 /var/mail, etc). It also would allow things like fchown() to work on the link rather than having to implement a new syscall specifically for that task. Reviewed by: phk
# 6b16931c	24-Feb-1998	Bruce Evans <bde@FreeBSD.org>	Removed unused #includes.
# 0b08f5f7	05-Feb-1998	Eivind Eklund <eivind@FreeBSD.org>	Back out DIAGNOSTIC changes.
# 47cfdb16	04-Feb-1998	Eivind Eklund <eivind@FreeBSD.org>	Turn DIAGNOSTIC into a new-style option.
# 925a3a41	11-Jan-1998	John Dyson <dyson@FreeBSD.org>	Fix some vnode management problems, and better mgmt of vnode free list. Fix the UIO optimization code. Fix an assumption in vm_map_insert regarding allocation of swap pagers. Fix an spl problem in the collapse handling in vm_object_deallocate. When pages are freed from vnode objects, and the criteria for putting the associated vnode onto the free list is reached, either put the vnode onto the list, or put it onto an interrupt safe version of the list, for further transfer onto the actual free list. Some minor syntax changes changing pre-decs, pre-incs to post versions. Remove a bogus timeout (that I added for debugging) from vn_lock. PHK will likely still have problems with the vnode list management, and so do I, but it is better than it was.
# 95e5e988	05-Jan-1998	John Dyson <dyson@FreeBSD.org>	Make our v_usecount vnode reference count work identically to the original BSD code. The association between the vnode and the vm_object no longer includes reference counts. The major difference is that vm_object's are no longer freed gratuitiously from the vnode, and so once an object is created for the vnode, it will last as long as the vnode does. When a vnode object reference count is incremented, then the underlying vnode reference count is incremented also. The two "objects" are now more intimately related, and so the interactions are now much less complex. When vnodes are now normally placed onto the free queue with an object still attached. The rundown of the object happens at vnode rundown time, and happens with exactly the same filesystem semantics of the original VFS code. There is absolutely no need for vnode_pager_uncache and other travesties like that anymore. A side-effect of these changes is that SMP locking should be much simpler, the I/O copyin/copyout optimizations work, NFS should be more ponderable, and further work on layered filesystems should be less frustrating, because of the totally coherent management of the vnode objects and vnodes. Please be careful with your system while running this code, but I would greatly appreciate feedback as soon a reasonably possible.
# 60f8d464	28-Dec-1997	John Dyson <dyson@FreeBSD.org>	Fix the decl of vfs_ioopt, allow LFS to compile again, fix a minor problem with the object cache removal.
# 2be70f79	28-Dec-1997	John Dyson <dyson@FreeBSD.org>	Lots of improvements, including restructring the caching and management of vnodes and objects. There are some metadata performance improvements that come along with this. There are also a few prototypes added when the need is noticed. Changes include: 1) Cleaning up vref, vget. 2) Removal of the object cache. 3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore. 4) Correct some missing LK_RETRY's in vn_lock. 5) Correct the page range in the code for msync. Be gentle, and please give me feedback asap.
# 2a024a2b	05-Dec-1997	Sean Eric Fagan <sef@FreeBSD.org>	Changes to allow event-based process monitoring and control.
# fd3bf775	28-Nov-1997	John Dyson <dyson@FreeBSD.org>	Fix and complete the AIO syscalls. There are some performance enhancements coming up soon, but the code is functional. Docs will be forthcoming.
# 4a11ca4e	07-Nov-1997	Poul-Henning Kamp <phk@FreeBSD.org>	Remove a bunch of variables which were unused both in GENERIC and LINT. Found by: -Wunused
# 32e4d4c5	27-Oct-1997	Bruce Evans <bde@FreeBSD.org>	Use 127 instead of CHAR_MAX for the limit on the sequence count. The limit doesn't have anything to do with characters. The count mainly needs to fit in the VOP_READ() ioflag after being left shifted by 16. Moved vn_lock() before vn_closefile(). vn_lock() was mismerged from Lite2. Removed some gratuitous braces.
# e7b0208f	05-Oct-1997	John Dyson <dyson@FreeBSD.org>	Relax the vnode locking for read only operations.
# a2f9bc72	13-Sep-1997	Peter Wemm <peter@FreeBSD.org>	vn_select -> vn_poll
# e4ba6a82	02-Sep-1997	Bruce Evans <bde@FreeBSD.org>	Removed unused #includes.
# 42146e37	04-Apr-1997	Doug Rabson <dfr@FreeBSD.org>	[Previous comment was incorrect for these files] Added calls to VFS lock debugging macros to make fixing filesystems' locking easier.
# de15ef6a	04-Apr-1997	Doug Rabson <dfr@FreeBSD.org>	Add a function vop_sharedlock which a copy of vop_nolock without the implementation #ifdef out. This can be used for now by NFS. As soon as all the other filesystems' locking is fixed, this can go away. Print the vnode address in vprint for easier debugging.
# 20982410	24-Mar-1997	Bruce Evans <bde@FreeBSD.org>	Don't include <sys/ioctl.h> in the kernel. Stage 4: include <sys/ttycom.h> and sometimes <sys/filio.h> instead of <sys/ioctl.h> in miscellaneous files. Most of these files have nothing to do with ttys but need to include <sys/ttycom.h> to get the definitions of TIOC[SG]PGRP which are (ab)used to convert F[SG]ETOWN fcntls into ioctls.
# 3ac4d1ef	22-Mar-1997	Bruce Evans <bde@FreeBSD.org>	Don't #include <sys/fcntl.h> in <sys/file.h> if KERNEL is defined. Fixed everything that depended on getting fcntl.h stuff from the wrong place. Most things don't depend on file.h stuff at all.
# dfd0621a	08-Mar-1997	Guido van Rooij <guido@FreeBSD.org>	Fix style bugs and other bugs in the NFS fix.
# 324d42ad	07-Mar-1997	Gary Palmer <gpalmer@FreeBSD.org>	Fix (I hope) the NFS hole. This is only compile tested. Submitted by: (partly) davids@SECNET.COM via BUGTRAQ
# 6875d254	22-Feb-1997	Peter Wemm <peter@FreeBSD.org>	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
# 996c772f	09-Feb-1997	John Dyson <dyson@FreeBSD.org>	This is the kernel Lite/2 commit. There are some requisite userland changes, so don't expect to be able to run the kernel as-is (very well) without the appropriate Lite/2 userland changes. The system boots and can mount UFS filesystems. Untested: ext2fs, msdosfs, NFS Known problems: Incorrect Berkeley ID strings in some files. Mount_std mounts will not work until the getfsent library routine is changed. Reviewed by: various people Submitted by: Jeffery Hsu <hsu@freebsd.org>
# 1130b656	14-Jan-1997	Jordan K. Hubbard <jkh@FreeBSD.org>	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# 8b612c4b	28-Dec-1996	John Dyson <dyson@FreeBSD.org>	This commit is the embodiment of some VFS read clustering improvements. Firstly, now our read-ahead clustering is on a file descriptor basis and not on a per-vnode basis. This will allow multiple processes reading the same file to take advantage of read-ahead clustering. Secondly, there previously was a problem with large reads still using the ramp-up algorithm. Of course, that was bogus, and now we read the entire "chunk" off of the disk in one operation. The read-ahead clustering algorithm should use less CPU than the previous also (I hope :-)). NOTE: THAT LKMS MUST BE REBUILT!!!
# 6476c0d2	21-Aug-1996	John Dyson <dyson@FreeBSD.org>	Even though this looks like it, this is not a complex code change. The interface into the "VMIO" system has changed to be more consistant and robust. Essentially, it is now no longer necessary to call vn_open to get merged VM/Buffer cache operation, and exceptional conditions such as merged operation of VBLK devices is simpler and more correct. This code corrects a potentially large set of problems including the problems with ktrace output and loaded systems, file create/deletes, etc. Most of the changes to NFS are cosmetic and name changes, eliminating a layer of subroutine calls. The direct calls to vput/vrele have been re-instituted for better cross platform compatibility. Reviewed by: davidg
# edbfedac	11-Mar-1996	Peter Wemm <peter@FreeBSD.org>	Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all files are off the vendor branch, so this should not change anything. A "U" marker generally means that the file was not changed in between the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally means that there was a change. [note new unused (in this form) syscalls.conf, to be 'cvs rm'ed]
# 0f20dc94	08-Mar-1996	John Dyson <dyson@FreeBSD.org>	Remove a now unnecessary function prototype.
# 91477adc	01-Mar-1996	John Dyson <dyson@FreeBSD.org>	Enable VMIO for non-VDIR metadata and block device.
# bd7e5f99	18-Jan-1996	John Dyson <dyson@FreeBSD.org>	Eliminated many redundant vm_map_lookup operations for vm_mmap. Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish overhead for merged cache. Efficiency improvement for vfs_cluster. It used to do alot of redundant calls to cluster_rbuild. Correct the ordering for vrele of .text and release of credentials. Use the selective tlb update for 486/586/P6. Numerous fixes to the size of objects allocated for files. Additionally, fixes in the various pagers. Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs. Fixes in the swap pager for exhausted resources. The pageout code will not as readily thrash. Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE), thereby improving efficiency of several routines. Eliminate even more unnecessary vm_page_protect operations. Significantly speed up process forks. Make vm_object_page_clean more efficient, thereby eliminating the pause that happens every 30seconds. Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the case of filesystems mounted async. Fix a panic with busy pages when write clustering is done for non-VMIO buffers.
# 27a0b398	17-Dec-1995	Poul-Henning Kamp <phk@FreeBSD.org>	Staticize. Unstaticize a function in scsi/scsi_base that was used, with an undocumented option. My last count on the LINT kernel shows: Total symbols: 3647 unref symbols: 463 undef symbols: 4 1 ref symbols: 1751 2 ref symbols: 485 Approaching the pain threshold now.
# a316d390	10-Dec-1995	John Dyson <dyson@FreeBSD.org>	Changes to support 1Tb filesizes. Pages are now named by an (object,index) pair instead of (object,offset) pair.
# efeaf95a	06-Dec-1995	David Greenman <dg@FreeBSD.org>	Untangled the vm.h include file spaghetti.
# d68a4190	22-Oct-1995	David Greenman <dg@FreeBSD.org>	Moved the filesystem read-only check out of the syscalls and into the filesystem layer, as was done in lite-2. Merged in some other cosmetic changes while I was at it. Rewrote most of msdosfs_access() to be more like ufs_access() and to include the FS read-only check. Obtained from: partially from 4.4BSD-lite2
# e83e1865	06-Oct-1995	Poul-Henning Kamp <phk@FreeBSD.org>	A little hack to avoid a 64bit divide. Can go away if Gcc ever learns to optimise 64bit stuff...
# 24aa09cd	20-Jul-1995	David Greenman <dg@FreeBSD.org>	vnode_pager_alloc() never returns NULL, so don't check for it.
# 97e15667	16-Jul-1995	Bruce Evans <bde@FreeBSD.org>	Don't include <sys/tty.h> in drivers that aren't tty drivers or in general files that don't depend on the internals of <sys/tty.h>
# 24a1cce3	13-Jul-1995	David Greenman <dg@FreeBSD.org>	NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct proc or any VM system structure will have to be rebuilt!!! Much needed overhaul of the VM system. Included in this first round of changes: 1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages, haspage, and sync operations are supported. The haspage interface now provides information about clusterability. All pager routines now take struct vm_object's instead of "pagers". 2) Improved data structures. In the previous paradigm, there is constant confusion caused by pagers being both a data structure ("allocate a pager") and a collection of routines. The idea of a pager structure has escentially been eliminated. Objects now have types, and this type is used to index the appropriate pager. In most cases, items in the pager structure were duplicated in the object data structure and thus were unnecessary. In the few cases that remained, a un_pager structure union was created in the object to contain these items. 3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now be removed. For instance, vm_object_enter(), vm_object_lookup(), vm_object_remove(), and the associated object hash list were some of the things that were removed. 4) simple_lock's removed. Discussion with several people reveals that the SMP locking primitives used in the VM system aren't likely the mechanism that we'll be adopting. Even if it were, the locking that was in the code was very inadequate and would have to be mostly re-done anyway. The locking in a uni-processor kernel was a no-op but went a long way toward making the code difficult to read and debug. 5) Places that attempted to kludge-up the fact that we don't have kernel thread support have been fixed to reflect the reality that we are really dealing with processes, not threads. The VM system didn't have complete thread support, so the comments and mis-named routines were just wrong. We now use tsleep and wakeup directly in the lock routines, for instance. 6) Where appropriate, the pagers have been improved, especially in the pager_alloc routines. Most of the pager_allocs have been rewritten and are now faster and easier to maintain. 7) The pagedaemon pageout clustering algorithm has been rewritten and now tries harder to output an even number of pages before and after the requested page. This is sort of the reverse of the ideal pagein algorithm and should provide better overall performance. 8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup have been removed. Some other unnecessary casts have also been removed. 9) Some almost useless debugging code removed. 10) Terminology of shadow objects vs. backing objects straightened out. The fact that the vm_object data structure escentially had this backwards really confused things. The use of "shadow" and "backing object" throughout the code is now internally consistent and correct in the Mach terminology. 11) Several minor bug fixes, including one in the vm daemon that caused 0 RSS objects to not get purged as intended. 12) A "default pager" has now been created which cleans up the transition of objects to the "swap" type. The previous checks throughout the code for swp->pg_data != NULL were really ugly. This change also provides the rudiments for future backing of "anonymous" memory by something other than the swap pager (via the vnode pager, for example), and it allows the decision about which of these pagers to use to be made dynamically (although will need some additional decision code to do this, of course). 13) (dyson) MAP_COPY has been deprecated and the corresponding "copy object" code has been removed. MAP_COPY was undocumented and non- standard. It was furthermore broken in several ways which caused its behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will continue to work correctly, but via the slightly different semantics of MAP_PRIVATE. 14) (dyson) Sharing maps have been removed. It's marginal usefulness in a threads design can be worked around in other ways. Both #12 and #13 were done to simplify the code and improve readability and maintain- ability. (As were most all of these changes) TODO: 1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing this will reduce the vnode pager to a mere fraction of its current size. 2) Rewrite vm_fault and the swap/vnode pagers to use the clustering information provided by the new haspage pager interface. This will substantially reduce the overhead by eliminating a large number of VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be improved to provide both a "behind" and "ahead" indication of contiguousness. 3) Implement the extended features of pager_haspage in swap_pager_haspage(). It currently just says 0 pages ahead/behind. 4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps via a much more general mechanism that could also be used for disk striping of regular filesystems. 5) Do something to improve the architecture of vm_object_collapse(). The fact that it makes calls into the swap pager and knows too much about how the swap pager operates really bothers me. It also doesn't allow for collapsing of non-swap pager objects ("unnamed" objects backed by other pagers).
# 06cb7259	09-Jul-1995	David Greenman <dg@FreeBSD.org>	Moved call to VOP_GETATTR() out of vnode_pager_alloc() and into the places that call vnode_pager_alloc() so that a failure return can be dealt with. This fixes a panic seen on NFS clients when a file being opened is deleted on the server before the open completes.
# 6663c3d5	27-Jun-1995	David Greenman <dg@FreeBSD.org>	Removed extra semicolon.
# aa2cabb9	27-Jun-1995	David Greenman <dg@FreeBSD.org>	1) Converted v_vmdata to v_object. 2) Removed unnecessary vm_object_lookup()/pager_cache(object, TRUE) pairs after vnode_pager_alloc() calls - the object is already guaranteed to be persistent. 3) Removed some gratuitous casts.
# 9b2e5354	30-May-1995	Rodney W. Grimes <rgrimes@FreeBSD.org>	Remove trailing whitespace.
# 54ea0c00	10-May-1995	David Greenman <dg@FreeBSD.org>	Unlock the vnode before sleeping on an OBJ_DEAD object. Should fix Bruce's hang. Fixed some formatting anomolies and removed some unneeded casts.
# 50475e8b	18-Mar-1995	David Greenman <dg@FreeBSD.org>	Removed unnecessary call to vnode_pager_uncache(). We automatically clear the VTEXT flag after all mappers have finished with the object.
# 9977034c	13-Feb-1995	Poul-Henning Kamp <phk@FreeBSD.org>	YFfix
# 0d94caff	09-Jan-1995	David Greenman <dg@FreeBSD.org>	These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
# 8e58bf68	05-Oct-1994	David Greenman <dg@FreeBSD.org>	Stuff object into v_vmdata rather than pager. Not important which at the moment, but will be in the future. Other changes mostly cosmetic, but are made for future VMIO considerations. Submitted by: John Dyson
# 797f2d22	02-Oct-1994	Poul-Henning Kamp <phk@FreeBSD.org>	All of this is cosmetic. prototypes, #includes, printfs and so on. Makes GCC a lot more silent.
# 605f11c8	17-Aug-1994	David Greenman <dg@FreeBSD.org>	Moved over my fix for vnode lossage when multiple TIOCSCTTY ioctls are done. This patch was extended to also include a suggested change by Kirk McKusick which allows the control tty to be reasigned to a different tty without losing a vnode.
# 3c4dd356	02-Aug-1994	David Greenman <dg@FreeBSD.org>	Added $Id$
# 26f9a767	25-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
# df8bae1d	24-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	BSD 4.4 Lite Kernel Sources