Cross Reference: /freebsd-11.0-release/sys/kern/vfs

History log of /freebsd-11.0-release/sys/kern/vfs_subr.c
Revision	Date	Author	Comments (<<< Hide modified files) (Show modified files >>>)
# 303975	11-Aug-2016	gjb	Copy stable/11@r303970 to releng/11.0 as part of the 11.0-RELEASE cycle. Prune svn:mergeinfo from the new branch, and rename it to RC1. Update __FreeBSD_version. Use the quarterly branch for the default FreeBSD.conf pkg(8) repo and the dvd1.iso packages population. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation /freebsd-11.0-release/etc/pkg/FreeBSD.conf /freebsd-11.0-release/release/pkg_repos/release-dvd.conf /freebsd-11.0-release/sys/conf/newvers.sh /freebsd-11.0-release/sys/sys/param.h
# 303290	25-Jul-2016	kib	MFC r302567: In vgonel(), postpone setting BO_DEAD until VOP_RECLAIM() is called, if vnode is VMIO. For VMIO vnodes, set BO_DEAD in vm_object_terminate(). MFC r302580: Fix grammar. Approved by: re (gjb)
# 302408	08-Jul-2016	gjb	Copy head@r302406 to stable/11 as part of the 11.0-RELEASE cycle. Prune svn:mergeinfo from the new branch, as nothing has been merged here. Additional commits post-branch will follow. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation
# 302322	03-Jul-2016	kib	Remove racy assert. The thread which changes vnode usecount from 0 to 1 does it under the vnode interlock, but the interlock is not owned by the asserting thread. As result, we might read increased use counter but also still see VI_OWEINACT. In collaboration with: nwhitehorn Hardware donated by: IBM LTC Sponsored by: The FreeBSD Foundation (kib) Approved by: re (gjb)
# 302029	20-Jun-2016	kib	Fix typo. Note that atomic is still required even for interlocked case. Sponsored by: The FreeBSD Foundation Approved by: re (marius)
# 302000	17-Jun-2016	mjg	vfs: ifdef out noop vop_* primitives on !DEBUG_VFS_LOCKS kernels This removes calls to empty functions like vop_lock_{pre/post} from common vfs routines. Approved by: re (gjb)
# 301996	17-Jun-2016	kib	Add VFS interface to flush specified amount of free vnodes belonging to mount points with the given filesystem type, specified by mount vfs_ops pointer. Based on patch by: mckusick Reviewed by: avg, mckusick Tested by: allanjude, madpilot Sponsored by: The FreeBSD Foundation Approved by: re (gjb)
# 301040	31-May-2016	trasz	Cosmetics - add missing space after ellipses in shutdown messages. MFC after: 1 month Sponsored by: The FreeBSD Foundation
# 299916	16-May-2016	avg	vfs_read_dirent: increment ncookies after adding a cookie It seems that at present vfs_read_dirent() is used only with filesystems that do not support cookies, so the bug never manifested itself. MFC after: 1 week
# 298982	03-May-2016	kib	Add EVFILT_VNODE open, read and close notifications. While there, order EVFILT_VNODE notes descriptions alphabetically. Based on submission, and tested by: Vladimir Kondratyev <wulf@cicgroup.ru> MFC after: 2 weeks
# 298922	02-May-2016	kib	Issue NOTE_EXTEND when a directory entry is added to or removed from the monitored directory as the result of rename(2) operation. The renames staying in the directory are not reported. Submitted by: Vladimir Kondratyev <wulf@cicgroup.ru> MFC after: 2 weeks
# 298921	02-May-2016	kib	Fix reporting of NOTE_LINK when directory link count changes due to rename removing or adding subdirectory entry. Discussed with and tested by: Vladimir Kondratyev <wulf@cicgroup.ru> NetBSD PR: 48958 (http://gnats.netbsd.org/48958) MFC after: 2 weeks Sponsored by: The FreeBSD Foundation
# 298819	29-Apr-2016	pfg	sys/kern: spelling fixes in comments. No functional change.
# 298649	26-Apr-2016	pfg	sys: extend use of the howmany() macro when available. We have a howmany() macro in the <sys/param.h> header that is convenient to re-use as it makes things easier to read.
# 295971	24-Feb-2016	kib	Provide more correct sizing of the KVA consumed by a vnode, used by the virtvnodes calculation. Include the size of fs-specific v_data as the nfs nclnode inline, the NFS nclnode is bigger than either ZFS znode or UFS inode. Include the size of namecache_ts and short cache path element, multiplied by the name cache population factor, again inline. Inline defines are used to avoid pollution of the vnode.h with the subsystem-private objects. Non-significant unsynchronized changes of the definitions are fine, we do not care about that precision, and e.g. ZFS consumes much malloced memory per vnode for reasons unaccounted in the formula. Lower the partition of kmem dedicated to vnodes, from 1/7 to 1/10. The measures reduce vnode cache pressure on kmem and bring the vnode cache memory use below some apparent thresholds that were exceeded by r291244 due to more robust vnode reuse. Reported and tested by: marius (i386, previous version) Reviewed by: bde Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 295716	17-Feb-2016	kib	In bnoreuselist(), check both ends of the specified logical block numbers range. This effectively skips indirect and extdata blocks on the buffer queue. Since their logical block numbers are negative, bnoreuselist() could loop infinitely. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 294299	18-Jan-2016	markj	Add vrefl(), a locked variant of vref(9). This API has no in-tree consumers at the moment but is useful to at least one out-of-tree consumer, and naturally complements existing vnode refcount functions (vholdl(9), vdropl(9)). Obtained from: kib (sys/ portion) Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D4947 Differential Revision: https://reviews.freebsd.org/D4953
# 293197	05-Jan-2016	kib	Two fixes for excessive iterations after r292326. Advance the logical block number to the lblkno of the found block plus one, instead of incrementing the block number which was used for lookup. This change skips sparcely populated buffer ranges, similar to r292325, instead of doing useless lookups. Do not restart the bnoreuselist() from the start of the range if buffer lock cannot be obtained without sleep. Only retry lookup and lock for the same queue and same logical block number. Reported by: benno Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days
# 292326	16-Dec-2015	kib	Optimize vop_stdadvise(POSIX_FADV_DONTNEED). Instead of looking up a buffer for each block number in the range with gbincore(), look up the next instantiated buffer with the logical block number which is greater or equal to the next lblkno. This significantly speeds up the iteration for sparce-populated range. Move the iteration into new helper bnoreuselist(), which is structured similarly to flushbuflist(). Reported and tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation
# 292325	16-Dec-2015	kib	Simplify the loop step in the flushbuflist() and make it independed on the type stability of the buffers memory. Instead of memoizing pointer to the next buffer and validating it, remember the next logical block number in the bo list and re-lookup. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation
# 291743	04-Dec-2015	mckusick	We need to zero out the clustering variables in a freed vnode structure. For completeness add a VNASSERT that there are no threads waiting on a range lock (this was previously checked on every vnode free). Reported by; Rick Macklem Fix from: Mateusz Guzik PR: 204949
# 291671	03-Dec-2015	mckusick	We need to zero out the union of pointers in a freed vnode structure. PR: 204949 Fix from: Mateusz Guzik Tested by: Jason Unovitch
# 291460	29-Nov-2015	mckusick	As the kernel allocates and frees vnodes, it fully initializes them on every allocation and fully releases them on every free. These are not trivial costs: it starts by zeroing a large structure then initializes a mutex, a lock manager lock, an rw lock, four lists, and six pointers. And looking at vfs.vnodes_created, these operations are being done millions of times an hour on a busy machine. As a performance optimization, this code update uses the uma_init and uma_fini routines to do these initializations and cleanups only as the vnodes enter and leave the vnode_zone. With this change the initializations are only done kern.maxvnodes times at system startup and then only rarely again. The frees are done only if the vnode_zone shrinks which never happens in practice. For those curious about the avoided work, look at the vnode_init() and vnode_fini() functions in kern/vfs_subr.c to see the code that has been removed from the main vnode allocation/free path. Reviewed by: kib Tested by: Peter Holm
# 291380	27-Nov-2015	kib	Remove VI_AGE vnode iflag, it is unused. Noted by: bde Sponsored by: The FreeBSD Foundation
# 291379	27-Nov-2015	kib	Move the comment about resident pages preventing vnode from leaving active list, into the header comment for vdrop(), which is the function that decides whether to leave the vnode on the list. Note that dirty page write-out in vinactive() is asynchronous. Discussed with: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 291244	24-Nov-2015	kib	Rework the vnode cache recycling to meet free and unused vnodes targets. See the comment above wantfreevnodes variable for the description of the algorithm. The vfs.vlru_alloc_cache_src sysctl is removed. New code frees namecache sources as the last chance to satisfy the highest watermark, instead of selecting the source vnodes randomly. This provides good enough behaviour to keep vn_fullpath() working in most situations. The filesystem layout with deep trees, where the removed knob was required, is thus handled automatically. Submitted by: bde Discussed with: mckusick Tested by: pho MFC after: 1 month
# 291116	21-Nov-2015	glebius	Remove remnants of the old NFS from vnode pager. Reviewed by: kib Sponsored by: Netflix
# 288280	26-Sep-2015	markj	Remove a check for a condition that is always false by a preceding KASSERT that was added in r144704.
# 288276	26-Sep-2015	markj	Fix argument ordering in vn_printf(). MFC after: 3 days
# 287831	15-Sep-2015	cem	kevent(2): Note DOOMED vnodes with NOTE_REVOKE In poll mode, check for and wake VBAD vnodes. (Vnodes that are VBAD at registration will never be woken by the RECLAIM trigger.) Add post-VOP_RECLAIM hook to trigger notes on vnode reclamation. (Vnodes that were fine at registration but are vgoned while being monitored should signal waiters.) Reviewed by: kib Approved by: markj (mentor) Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3675
# 287497	06-Sep-2015	mckusick	Track changes to kern.maxvnodes and appropriately increase or decrease the size of the name cache hash table (mapping file names to vnodes) and the vnode hash table (mapping mount point and inode number to vnode). An appropriate locking strategy is the key to changing hash table sizes while they are in active use. Reviewed by: kib Tested by: Peter Holm Differential Revision: https://reviews.freebsd.org/D2265 MFC after: 2 weeks
# 287107	24-Aug-2015	trasz	Make vfs_unmountall() unmount /dev after /, not before. The only reason this didn't result in an unclean shutdown is that devfs ignores MNT_FORCE flag. Reviewed by: kib@ MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D3467
# 287033	23-Aug-2015	trasz	After r286237 it should be fine to call vgone(9) on a busy GEOM vnode; remove KASSERT that would prevent forced devfs unmount from working. MFC after: 1 month Sponsored by: The FreeBSD Foundation
# 286307	05-Aug-2015	ed	Make it possible to implement poll(2) on top of kqueue(2). It looks like EVFILT_READ and EVFILT_WRITE trigger under the same conditions as poll()'s POLLRDNORM and POLLWRNORM as described by POSIX. The only difference is that POLLRDNORM has to be triggered on regular files unconditionally, whereas EVFILT_READ only triggers when not EOF. Introduce a new flag, NOTE_FILE_POLL, that can be used to make EVFILT_READ and EVFILT_WRITE behave identically to poll(). This flag will be used by cloudlibc's poll() function. Reviewed by: jmg Differential Revision: https://reviews.freebsd.org/D3303
# 286281	04-Aug-2015	trasz	Mark vgonel() as static. It was already declared static earlier; no idea why compilers don't warn about this. MFC after: 1 month Sponsored by: The FreeBSD Foundation
# 285632	16-Jul-2015	mjg	vfs: implement v_holdcnt/v_usecount manipulation using atomic ops Transitions 0->1 and 1->0 (which decide e.g. on putting the vnode on the free list) of either counter are still guarded with vnode interlock. Reviewed by: kib (earlier version) Tested by: pho
# 285393	11-Jul-2015	mjg	vfs: always clear VI_OWEINACT in consumers bumping v_usecount Previously vputx would detect the condition and clear the flag. With this change it is invalid to have both v_usecount > 0 and the flag set. Assert the condition is met in all revlevant places. Reviewed by: kib
# 285392	11-Jul-2015	mjg	vfs: move si_usecount manipulation to dedicated functions Reviewed by: kib
# 285384	11-Jul-2015	kib	Do not allow creation of the dirty buffers for the dead buffer objects, i.e. for buffer objects which vnode was reclaimed. Buffer cache cannot write such buffers. Return the error and discard the buffer immediately on write attempt. BO_DIRTY now always set during vnode reclamation, since it is used not only for the INVARIANTS checks. Do allow placement of the clean buffers on dead bufobj list, otherwise filesystems cannot use bufcache at all after the devvp reclaim. Reported and tested by: trasz Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 285183	05-Jul-2015	markj	Remove a stale descriptive comment for gbincore(). The splay trees referenced in the comment were converted to path-compressed tries in r250551. MFC after: 3 days
# 284733	23-Jun-2015	jmg	zero this struct as it depends upon it... Reviewed by: mjg Differential Revision: https://reviews.freebsd.org/D2890
# 284495	17-Jun-2015	kib	vfs_msync(), called from syncer vnode fsync VOP, only iterates over the active vnode list for the given mount point, with the assumption that vnodes with dirty pages are active. This is enforced by vinactive() doing vm_object_page_clean() pass over the vnode pages. The issue is, if vinactive() cannot be called during vput() due to the vnode being only shared-locked, we might end up with the dirty pages for the vnode on the free list. Such vnode is invisible to syncer, and pages are only cleaned on the vnode reactivation. In other words, the race results in the broken guarantee that user data, written through the mmap(2), is written to the disk not later than in 30 seconds after the write. Fix this by keeping the vnode which is freed but still owing inactivation, on the active list. When syncer loops find such vnode, it is deactivated and cleaned by the final vput() call. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 283602	27-May-2015	kib	Right now, dounmount() is called with unreferenced mount point. Nothing stops a parallel unmount to suceed before the given call to dounmount() checks and locks the covered vnode. Prevent dounmount() from acting on the freed (although type-stable) memory by changing the interface to require the mount point to be referenced. dounmount() consumes the reference on return, regardless of the sucessfull or erronous result. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 281562	15-Apr-2015	rmacklem	File systems that do not use the buffer cache (such as ZFS) must use VOP_FSYNC() to perform the NFS server's Commit operation. This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which is set by file systems that use the buffer cache. If this flag is not set, the NFS server always does a VOP_FSYNC(). This should be ok for old file system modules that do not set MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although it might not be optimal for file systems that use the buffer cache. Reviewed by: kib MFC after: 2 weeks
# 279362	27-Feb-2015	kib	The VNASSERT in vflush() FORCECLOSE case is trying to panic early to prevent errors from yanking devices out from under filesystems. Only care about special vnodes on devfs, special nodes on other kinds of filesystems do not have special properties. Sponsored by: EMC / Isilon Storage Division Submitted by: Conrad Meyer MFC after: 1 week
# 278891	17-Feb-2015	ngie	Add the mnt_lockref field to the ddb(4) 'show mount' command MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D1688 Submitted by: Conrad Meyer <conrad.meyer@isilon.com> Sponsored by: EMC / Isilon Storage Division
# 278760	14-Feb-2015	jhb	Add two new counters for vnode life cycle events: - vfs.recycles counts the number of vnodes forcefully recycled to avoid exceeding kern.maxvnodes. - vfs.vnodes_created counts the number of vnodes created by successful calls to getnewvnode(). Differential Revision: https://reviews.freebsd.org/D1671 Reviewed by: kib MFC after: 1 week
# 277712	25-Jan-2015	jhb	Change the default VFS timestamp precision from seconds to microseconds. Discussed on: arch@ MFC after: 2 weeks
# 275743	13-Dec-2014	kib	The vinactive() call in vgonel() may start writes for the dirty pages, creating delayed write buffers belonging to the reclaimed vnode. Put the buffer cleanup code after inactivation. Add asserts that ensure that buffer queues are empty and add BO_DEAD flag for bufobj to check that no buffers are added after the cleanup. BO_DEAD is only used by INVARIANTS-enabled kernels. Reported and tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 275637	09-Dec-2014	kib	Apply chunk forgotten in r275620. Remove local variable for real. CID: 1257462 Sponsored by: The FreeBSD Foundation
# 275620	08-Dec-2014	kib	Add functions syncer_suspend() and syncer_resume(), which are supposed to be called before suspension and after resume, correspondingly. The syncer_suspend() ensures that all filesystems dirty data and metadata are saved to the permanent storage, and stops kernel threads which might modify filesystems. The syncer_resume() restores stopped threads. For now, only syncer is stopped. This is needed, because each sync loop causes superblock updates for UFS. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 273118	15-Oct-2014	mjg	Don't take devmtx unnecessarily in vn_isdisk. MFC after: 1 week
# 272366	01-Oct-2014	will	In the syncer, drop the sync mutex while patting the watchdog. Some watchdog drivers (like ipmi) need to sleep while patting the watchdog. See sys/dev/ipmi/ipmi.c:ipmi_wd_event(), which calls malloc(M_WAITOK). Submitted by: asomers MFC after: 1 month Sponsored by: Spectra Logic MFSpectraBSD: 637548 on 2012/10/04
# 269457	03-Aug-2014	kib	Remove Giant acquisition from the mount and unmount pathes. It could be claimed that two things were reasonable protected by Giant. One is vfsconf list links, which is converted to the new dedicated sx vfsconf_sx. Another is vfsconf.vfc_refcount, which is now updated with atomics. Note that vfc_refcount still has the same races now as it has under the Giant, the unload of filesystem modules can happen while the module is still in use. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 269244	29-Jul-2014	kib	Remove one-time use macros which check for the vnode lifecycle. More, some parts of the checks are in fact redundand in the surrounding code, and it is more clear what the conditions are by direct testing of the flags. Two of the three macros were only used in assertions. In vnlru_free(), all relevant parts of vholdl() were already inlined, except the increment of v_holdcnt itself. Do not call vholdl() to do the increment as well, this allows to make assertions in vholdl()/vhold() more strict. In v_incr_usecount(), call vholdl() before incrementing other ref counters. The change is no-op, but it makes less surprising to see the vnode state in debugger if interrupted inside v_incr_usecount(). Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 267392	12-Jun-2014	mav	Implement simple direct-mapped cache for popular filesystem identifiers to avoid congestion on global mountlist_mtx mutex in vfs_busyfs(), while traversing through the list of mount points. This change significantly improves NFS server scalability, since it had to do this translation for every request, and the global lock becomes quite congested. This code is more optimized for relatively small number of mount points. On systems with hundreds of active mount points this simple cache may have many collisions. But the original traversal code in that case should also behave much worse, so we are not loosing much. Reviewed by: attilio MFC after: 2 weeks Sponsored by: iXsystems, Inc.
# 267362	11-Jun-2014	mav	Remove unneeded mountlist_mtx acquisition from sync_fsync(). All struct mount fields accessed by sync_fsync() are protected by MNT_MTX.
# 267239	08-Jun-2014	mav	Remove extra branching from r267232. MFC after: 2 weeks
# 267232	08-Jun-2014	mav	Use atomics to modify numvnodes variable. This allows to mostly avoid lock usage in getnewvnode_[drop_]reserve(), that reduces number of global vnode_free_list_mtx mutex acquisitions from 4 to 2 per NFS request on ZFS, improving SMP scalability. Reviewed by: kib MFC after: 2 weeks Sponsored by: iXsystems, Inc.
# 266482	21-May-2014	bjk	Check for mismatched vref()/vdrop() Assert that the hold count has not fallen below the use count, a situation that would only happen when a vref() (or similar) is erroneously paired with a vdrop(). This situation has not been observed in the wild, but could be helpful for someone implementing a new filesystem. Reviewed by: kib Approved by: hrs (mentor)
# 263620	22-Mar-2014	bdrewery	Rename global cnt to vm_cnt to avoid shadowing. To reduce the diff struct pcu.cnt field was not renamed, so PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in kvm(3) and vmstat(8). The goal was to not affect externally used KPI. Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the the global cnt variable. Exp-run revealed no ports using it directly. No objection from: arch@ Sponsored by: EMC / Isilon Storage Division
# 256211	09-Oct-2013	kib	Do not flush buffers when the v_object of the passed vnode does not really belong to it. Such vnodes, with the pointers to other vnodes v_objects, are typically instantiated by the bypass filesystems. Invalidating mappings of other vnode pages and the pages is wrong, since reclamation of the upper vnode does not imply that lower vnode is reclaimed too. One of the consequences of the improper reclamation was destruction of the wired mappings of the lower vnode pages, triggering miscellaneous assertions in the VM system. Reported by: John Marshall <john.marshall@riverwillow.com.au> Tested by: John Marshall <john.marshall@riverwillow.com.au>, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (gjb)
# 255979	01-Oct-2013	kib	When printing the vnode information from ddb, print the lengths of the dirty and clean buffer queues. Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (gjb)
# 255942	29-Sep-2013	kib	For vunref(), try to upgrade the vnode lock if the function was called with the vnode shared-locked. If upgrade succeeded, the inactivation can be done immediately, instead of being postponed. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (glebius)
# 255880	26-Sep-2013	kib	Acquire a hold reference on the vnode when a knote is instantiated. Otherwise, knote keeps a pointer to a vnode which could become invalid any time. Reported by: many Tested by: Patrick Lamaiziere <patfbsd@davenulle.org> Discussed with: jmg Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (marius)
# 254446	17-Aug-2013	pjd	In r114945 the line 'nmp = TAILQ_NEXT(mp, mnt_list);' was duplicated. Instead of just removing the duplicate, convert the loop to TAILQ_FOREACH().
# 253737	28-Jul-2013	kib	When creation of the v_pollinfo raced and our instance of vpollinfo must be destroyed, knlist_clear() and seldrain() calls could be avoided, since vpollinfo was not used. More, the knlist_clear() calling protocol requires the knlist locked, which is not true at the call site. Split the destruction into the helper destroy_vpollinfo_free(), and call it when raced, instead of destroy_vpollinfo(). Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days
# 253417	17-Jul-2013	kib	Clear the vnode knotes before destroying vpollinfo. Reported and tested by: Patrick Lamaiziere <patfbsd@davenulle.org> Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 251322	03-Jun-2013	kib	Be more generous when donating the current thread time to the owner of the vnode lock while iterating over the free vnode list. Instead of yielding, pause for 1 tick. The change is reported to help in some virtualized environments. Submitted by: Roger Pau Monn? <roger.pau@citrix.com> Discussed with: jilles Tested by: pho MFC after: 2 weeks
# 251171	31-May-2013	jeff	- Convert the bufobj lock to rwlock. - Use a shared bufobj lock in getblk() and inmem(). - Convert softdep's lk to rwlock to match the bufobj lock. - Move INFREECNT to b_flags and protect it with the buf lock. - Remove unnecessary locking around bremfree() and BKGRDINPROG. Sponsored by: EMC / Isilon Storage Division Discussed with: mckusick, kib, mdf
# 250551	12-May-2013	jeff	- Add a new general purpose path-compressed radix trie which can be used with any structure containing a uint64_t index. The tree code auto-generates type safe wrappers. - Eliminate the buf splay and replace it with pctrie. This is not only significantly faster with large files but also allows for the possibility of shared locking. Reviewed by: alc, attilio Sponsored by: EMC / Isilon Storage Division
# 250505	11-May-2013	kib	- Fix nullfs vnode reference leak in nullfs_reclaim_lowervp(). The null_hashget() obtains the reference on the nullfs vnode, which must be dropped. - Fix a wart which existed from the introduction of the nullfs caching, do not unlock lower vnode in the nullfs_reclaim_lowervp(). It should be innocent, but now it is also formally safe. Inform the nullfs_reclaim() about this using the NULLV_NOUNLOCK flag set on nullfs inode. - Add a callback to the upper filesystems for the lower vnode unlinking. When inactivating a nullfs vnode, check if the lower vnode was unlinked, indicated by nullfs flag NULLV_DROP or VV_NOSYNC on the lower vnode, and reclaim upper vnode if so. This allows nullfs to purge cached vnodes for the unlinked lower vnode, avoiding excessive caching. Reported by: G??ran L??wkrantz <goran.lowkrantz@ismobile.com> Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 250411	09-May-2013	marcel	Add option WITNESS_NO_VNODE to suppress printing LORs between VNODE locks. To support this, VNODE locks are created with the LK_IS_VNODE flag. This flag is propagated down using the LO_IS_VNODE flag. Note that WITNESS still records the LOR. Only the printing and the optional entering into the kernel debugger is bypassed with the WITNESS_NO_VNODE option.
# 250247	04-May-2013	mdf	Add missing vdrop() in error case. Submitted by: Fahad (mohd.fahadullah@isilon.com) MFC after: 1 week
# 249548	16-Apr-2013	rmacklem	Allow the vnode to be unlocked for the weird case of LK_EXCLOTHER. LK_EXCLOTHER is only used to acquire a usecount on a vnode during NFSv4 recovery from an expired lease. Reported and tested by: pho MFC after: 2 weeks
# 249218	06-Apr-2013	jeff	Prepare to replace the buf splay with a trie: - Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists. No consumers need to find them there and it complicates the tree. These flags are all FFS specific and could be moved out of the buf cache. - Use pbgetvp() and pbrelvp() to associate the background and journal bufs with the vp. Not only is this much cheaper it makes more sense for these transient bufs. - Fix the assertions in pbget* and pbrel*. It's not safe to check list pointers which were never initialized. Use the BX flags instead. We also check B_PAGING in reassignbuf() so this should cover all cases. Discussed with: kib, mckusick, attilio Sponsored by: EMC / Isilon Storage Division
# 248084	09-Mar-2013	attilio	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
# 245411	14-Jan-2013	kib	Add a trivial comment to record the proper commit log for r245407: Set the v_hash for a new vnode in the getnewvnode() to the value calculated based on the vnode structure address. Filesystems using vfs_hash_insert() override the v_hash using the standard formula of (inode_number + mnt_hashseed). For other filesystems, the initialization allows the vfs_hash_index() to provide useful hash too. Suggested, reviewed and tested by: peter Sponsored by: The FreeBSD Foundation MFC after: 5 days
# 245407	14-Jan-2013	kib	diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c index 7c243b6..0bdaf36 100644 --- a/sys/kern/vfs_subr.c +++ b/sys/kern/vfs_subr.c @@ -279,6 +279,7 @@ SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW, #define VSHOULDFREE(vp) (!((vp)->v_iflag & VI_FREE) && !(vp)->v_holdcnt) #define VSHOULDBUSY(vp) (((vp)->v_iflag & VI_FREE) && (vp)->v_holdcnt) +static int vnsz2log; /* * Initialize the vnode management data structures. @@ -293,6 +294,7 @@ SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW, static void vntblinit(void dummy __unused) { + u_int i; int physvnodes, virtvnodes; / @@ -332,6 +334,9 @@ vntblinit(void dummy __unused) syncer_maxdelay = syncer_mask + 1; mtx_init(&sync_mtx, "Syncer mtx", NULL, MTX_DEF); cv_init(&sync_wakeup, "syncer"); + for (i = 1; i <= sizeof(struct vnode); i <<= 1) + vnsz2log++; + vnsz2log--; } SYSINIT(vfs, SI_SUB_VFS, SI_ORDER_FIRST, vntblinit, NULL); @@ -1067,6 +1072,14 @@ alloc: } rangelock_init(&vp->v_rl); + / + * For the filesystems which do not use vfs_hash_insert(), + * still initialize v_hash to have vfs_hash_index() useful. + * E.g., nullfs uses vfs_hash_index() on the lower vnode for + * its own hashing. + / + vp->v_hash = (uintptr_t)vp >> vnsz2log; + vpp = vp; return (0); }
# 244706	26-Dec-2012	attilio	Fixup r244240: mp_ncpus will be 1 also in the !SMP and smp_disabled=1 case. There is no point in optimizing further the code and use a TRUE litteral for a path that does heavyweight stuff anyway (like lock acq), at the price of obfuscated code. Use the appropriate check where necessary and remove a macro. Sponsored by: EMC / Isilon storage division MFC after: 3 days
# 244534	21-Dec-2012	attilio	Fixup r218424: uio_yield() was scaling directly to userland priority. When kern_yield() was introduced with the possibility to specify a new priority, the behaviour changed by not lowering priority at all in the consumers, making the yielding mechanism highly ineffective for high priority kthreads like bufdaemon, syncer, vlrudaemon, etc. There are no evidences that consumers could bear with such change in semantic and this situation could finally lead to bugs similar to the ones fixed in r244240. Re-specify userland pri for kthreads involved. Tested by: pho Reviewed by: kib, mdf MFC after: 1 week
# 244240	15-Dec-2012	kib	When mnt_vnode_next_active iterator cannot lock the next vnode and yields, specify the user priority for the yield. Otherwise, a higher-priority (kernel) thread could fall into the priority-inversion with the thread owning the mutex lock. On single-processor machines or UP kernels, do not loop adaptively when the next vnode cannot be locked, instead yield unconditionally. Restructure the iteration initializer and the iterator to remove code duplication. Put the code to fetch and lock a vnode next to the current marker, into the mnt_vnode_next_active() function, and use it instead of repeating the loop. Reported by: hrs, rmacklem Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days
# 244095	10-Dec-2012	kib	Do not yield while owning a mutex. The Giant reacquire in the kern_yield() is problematic than. The owned mutex is the mount interlock, and it is in fact not needed to guarantee the stability of the mount list of active vnodes, so fix the the issue by only taking the mount interlock for MNT_REF and MNT_REL operations. While there, augment the unconditional yield by some amount of spinning [1]. Reported and tested by: pho Reviewed by: attilio Submitted by: attilio [1] MFC after: 3 days
# 243835	03-Dec-2012	kib	The vnode_free_list_mtx is required unconditionally when iterating over the active list. The mount interlock is not enough to guarantee the validity of the tailq link pointers. The __mnt_vnode_next_active() and __mnt_vnode_first_active() active lists iterators helper functions did not provided the neccessary stability for the list, allowing the iterators to pick garbage. This was uncovered after the r243599 made the active list iterators non-nop. Since a vnode interlock is before the vnode_free_list_mtx, obtain the vnode ilock in the non-blocking manner when under vnode_free_list_mtx, and restart iteration after the yield if the lock attempt failed. Assert that a vnode found on the list is active, and assert that the helpers return the vnode with interlock owned. Reported and tested by: pho MFC after: 1 week
# 243599	27-Nov-2012	davidxu	Take first active vnode correctly. Reviewed by: kib MFC after: 3 days
# 243499	24-Nov-2012	avg	assert_vop_locked: make the assertion race-free and more efficient this is really a minor improvement for the sake of correctness MFC after: 6 days
# 243400	22-Nov-2012	avg	remove vop_lookup_pre and vop_lookup_post Suggested by: kib MFC after: 5 days
# 243307	19-Nov-2012	attilio	insmntque() is always called with the lock held in exclusive mode, then: - assume the lock is held in exclusive mode and remove a moot check about the lock acquisition. - in the destructor remove !MPSAFE specific chunk. Reviewed by: kib MFC after: 2 weeks
# 243272	19-Nov-2012	avg	assert_vop_locked should treat LK_EXCLOTHER as the not locked case ... from a perspective of the current thread. Spotted by: mjg Discussed with: kib MFC after: 18 days
# 243271	19-Nov-2012	avg	vnode_if: fix locking protocol description for lookup and cachedlookup Also remove the checks from vop_lookup_pre and vop_lookup_post, which are now completely redundant (before this change they were partially redundant). Discussed with: kib MFC after: 10 days
# 242833	09-Nov-2012	attilio	Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag. Porters should refer to __FreeBSD_version 1000021 for this change as it may have happened at the same timeframe.
# 242617	05-Nov-2012	kib	A clarification to the behaviour of the active vnode list management regarding the vnode page cleaning. In collaboration with: pho MFC after: 1 week
# 242560	04-Nov-2012	kib	Add decoding of the missed MNT_KERN_ flags to ddb "show mount" command. MFC after: 3 weeks
# 242559	04-Nov-2012	kib	Add decoding of the missed VI_ and VV_ flags to ddb "show vnode" command. MFC after: 3 days
# 242556	04-Nov-2012	kib	Order the enumeration of the MNT_ flags to be the same as the order of their definitions. MFC after: 3 days
# 241896	22-Oct-2012	kib	Remove the support for using non-mpsafe filesystem modules. In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems. The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes. Conducted and reviewed by: attilio Tested by: pho
# 241556	14-Oct-2012	kib	Add a KPI to allow to reserve some amount of space in the numvnodes counter, without actually allocating the vnodes. The supposed use of the getnewvnode_reserve(9) is to reclaim enough free vnodes while the code still does not hold any resources that might be needed during the reclamation, and to consume the slack later for getnewvnode() calls made from the innards. After the critical block is finished, the caller shall free any reserve left, by getnewvnode_drop_reserve(9). Reviewed by: avg Tested by: pho MFC after: 1 week
# 240475	13-Sep-2012	attilio	Remove all the checks on curthread != NULL with the exception of some MD trap checks (eg. printtrap()). Generally this check is not needed anymore, as there is not a legitimate case where curthread != NULL, after pcpu 0 area has been properly initialized. Reviewed by: bde, jhb MFC after: 1 week
# 240284	09-Sep-2012	kib	Add a facility for vgone() to inform the set of subscribed mounts about vnode reclamation. Typical use is for the bypass mounts like nullfs to get a notification about lower vnode going away. Now, vgone() calls new VFS op vfs_reclaim_lowervp() with an argument lowervp which is reclaimed. It is possible to register several reclamation event listeners, to correctly handle the case of several nullfs mounts over the same directory. For the filesystem not having nullfs mounts over it, the overhead added is a single mount interlock lock/unlock in the vnode reclamation path. In collaboration with: pho MFC after: 3 weeks
# 239588	22-Aug-2012	kib	Provide some compat32 shims for sysctl vfs.conflist. It is required for getvfsbyname(3) operation when called from 32bit process, and getvfsbyname(3) is used by recent bsdtar import. Reported by: many Tested by: David Naylor <naylor.b.david@gmail.com> MFC after: 5 days
# 236503	03-Jun-2012	avg	free wdog_kern_pat calls in post-panic paths from under SW_WATCHDOG Those calls are useful with hardware watchdog drivers too. MFC after: 3 weeks
# 236317	30-May-2012	kib	Add a rangelock implementation, intended to be used to range-locking the i/o regions of the vnode data space. The implementation is quite simple-minded, it uses the list of the lock requests, ordered by arrival time. Each request may be for read or for write. The implementation is fair FIFO. MFC after: 2 month
# 234607	23-Apr-2012	trasz	Remove unused thread argument to vrecycle(). Reviewed by: kib
# 234605	23-Apr-2012	trasz	Remove unused thread argument from vtruncbuf(). Reviewed by: kib
# 234483	20-Apr-2012	mckusick	This update uses the MNT_VNODE_FOREACH_ACTIVE interface that loops over just the active vnodes associated with a mount point to replace MNT_VNODE_FOREACH_ALL in the vfs_msync, ffs_sync_lazy, and qsync routines. The vfs_msync routine is run every 30 seconds for every writably mounted filesystem. It ensures that any files mmap'ed from the filesystem with modified pages have those pages queued to be written back to the file from which they are mapped. The ffs_lazy_sync and qsync routines are run every 30 seconds for every writably mounted UFS/FFS filesystem. The ffs_lazy_sync routine ensures that any files that have been accessed in the previous 30 seconds have had their access times queued for updating in the filesystem. The qsync routine ensures that any files with modified quotas have those quotas queued to be written back to their associated quota file. In a system configured with 250,000 vnodes, less than 1000 are typically active at any point in time. Prior to this change all 250,000 vnodes would be locked and inspected twice every minute by the syncer. For UFS/FFS filesystems they would be locked and inspected six times every minute (twice by each of these three routines since each of these routines does its own pass over the vnodes associated with a mount point). With this change the syncer now locks and inspects only the tiny set of vnodes that are active. Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks
# 234482	20-Apr-2012	mckusick	This change creates a new list of active vnodes associated with a mount point. Active vnodes are those with a non-zero use or hold count, e.g., those vnodes that are not on the free list. Note that this list is in addition to the list of all the vnodes associated with a mount point. To avoid adding another set of linkage pointers to the vnode structure, the active list uses the existing linkage pointers used by the free list (previously named v_freelist, now renamed v_actfreelist). This update adds the MNT_VNODE_FOREACH_ACTIVE interface that loops over just the active vnodes associated with a mount point (typically less than 1% of the vnodes associated with the mount point). Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks
# 234443	18-Apr-2012	mckusick	Delete a no longer useful VNASSERT missed during changes in 234400. Suggested by: kib
# 234441	18-Apr-2012	mckusick	Fix a memory leak of M_VNODE_MARKER introduced in 234386. Found by: Peter Holm
# 234400	17-Apr-2012	mckusick	Drop export of vdestroy() function from kern/vfs_subr.c as it is used only as a helper function in that file. Replace sole call to vbusy() with inline code in vholdl(). Replace sole calls to vfree() and vdestroy() with inline code in vdropl(). The Clang compiler already inlines these functions, so they do not show up in a kernel backtrace which is confusing. Also you cannot set their frame in kgdb which means that it is impossible to view their local variables. So, while the produced code is unchanged, the debugging should be easier. Discussed with: kib MFC after: 2 weeks
# 234386	17-Apr-2012	mckusick	Replace the MNT_VNODE_FOREACH interface with MNT_VNODE_FOREACH_ALL. The primary changes are that the user of the interface no longer needs to manage the mount-mutex locking and that the vnode that is returned has its mutex locked (thus avoiding the need to check to see if its is DOOMED or other possible end of life senarios). To minimize compatibility issues for third-party developers, the old MNT_VNODE_FOREACH interface will remain available so that this change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH will be removed in head. The reason for this update is to prepare for the addition of the MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the active vnodes associated with a mount point (typically less than 1% of the vnodes associated with the mount point). Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks
# 234158	11-Apr-2012	mckusick	Export vinactive() from kern/vfs_subr.c (e.g., make it no longer static and declare its prototype in sys/vnode.h) so that it can be called from process_deferred_inactive() (in ufs/ffs/ffs_snapshot.c) instead of the body of vinactive() being cut and pasted into process_deferred_inactive(). Reviewed by: kib MFC after: 2 weeks
# 232709	09-Mar-2012	kib	Decomission mnt_noasync. Introduce MNTK_NOASYNC mnt_kern_flag which allows a filesystem to request VFS to not allow MNTK_ASYNC. MFC after: 1 week
# 232152	25-Feb-2012	trociny	When detaching an unix domain socket, uipc_detach() checks unp->unp_vnode pointer to detect if there is a vnode associated with (binded to) this socket and does necessary cleanup if there is. The issue is that after forced unmount this check may be too late as the unp_vnode is reclaimed and the reference is stale. To fix this provide a helper function that is called on a socket vnode reclamation to do necessary cleanup. Pointed by: kib Reviewed by: kib MFC after: 2 weeks
# 231075	06-Feb-2012	kib	Current implementations of sync(2) and syncer vnode fsync() VOP uses mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which is needed to guarantee a synchronous completion of the initiated i/o before syscall or VOP return. Global removal of MNTK_ASYNC option is harmful because not only i/o started from corresponding thread becomes synchronous, but all i/o is synchronous on the filesystem which is initiated during sync(2) or syncer activity. Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local thread flag to disable async i/o for current thread only. Use the opportunity to move DOINGASYNC() macro into sys/vnode.h and consistently use it through places which tested for MNTK_ASYNC. Some testing demonstrated 60-70% improvements in run time for the metadata-intensive operations on async-mounted UFS volumes, but still with great deviation due to other reasons. Reviewed by: mckusick Tested by: scottl MFC after: 2 weeks
# 230553	25-Jan-2012	kib	When doing vflush(WRITECLOSE), clean vnode pages. Unmounts do vfs_msync() before calling VFS_UNMOUNT(), but there is still a race allowing a process to dirty pages after msync finished. Remounts rw->ro just left dirty pages in system. Reviewed by: alc, tegge (long time ago) Tested by: pho MFC after: 2 weeks
# 230249	17-Jan-2012	mckusick	Make sure all intermediate variables holding mount flags (mnt_flag) and that all internal kernel calls passing mount flags are declared as uint64_t so that flags in the top 32-bits are not lost. MFC after: 2 weeks
# 229727	06-Jan-2012	jhb	Use proper argument structure types for the extattr post-VOP hooks. The wrong structure happened to work since the only argument used was the vnode which is in the same place in both VOP_SETATTR() and the two extattr VOPs. MFC after: 3 days
# 228849	23-Dec-2011	jhb	Add post-VOP hooks for VOP_DELETEEXTATTR() and VOP_SETEXTATTR() and use these to trigger a NOTE_ATTRIB EVFILT_VNODE kevent when the extended attributes of a vnode are changed. Note that OS X already implements this behavior. Reviewed by: rwatson MFC after: 2 weeks
# 227070	04-Nov-2011	jhb	Add the posix_fadvise(2) system call. It is somewhat similar to madvise(2) except that it operates on a file descriptor instead of a memory region. It is currently only supported on regular files. Just as with madvise(2), the advice given to posix_fadvise(2) can be divided into two types. The first type provide hints about data access patterns and are used in the file read and write routines to modify the I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are thus filesystem independent. Note that to ease implementation (and since this API is only advisory anyway), only a single non-normal range is allowed per file descriptor. The second type of hints are used to hint to the OS that data will or will not be used. These hints are implemented via a new VOP_ADVISE(). A default implementation is provided which does nothing for the WILLNEED request and attempts to move any clean pages to the cache page queue for the DONTNEED request. This latter case required two other changes. First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests vinvalbuf() to only flush clean buffers for the vnode from the buffer cache and to not remove any backing pages from the vnode. This is used to ensure clean pages are not wired into the buffer cache before attempting to move them to the cache page queue. The second change adds a new vm_object_page_cache() method. This method is somewhat similar to vm_object_page_remove() except that instead of freeing each page in the specified range, it attempts to move clean pages to the cache queue if possible. To preserve the ABI of struct file, the f_cdevpriv pointer is now reused in a union to point to the currently active advice region if one is present for regular files. Reviewed by: jilles, kib, arch@ Approved by: re (kib) MFC after: 1 month
# 226849	27-Oct-2011	jhb	Whitespace fix.
# 226734	25-Oct-2011	pjd	The v_data field is a pointer, so set it to NULL, not 0. MFC after: 3 days
# 226098	07-Oct-2011	jonathan	Change one printf() to log(). As noted in kern/159780, printf() is not very jail-friendly, since it can't be easily monitored by jail management tools. This patch reports an error via log() instead, which, if nobody is watching the log file, still prints to the console. Approved by: mentor (rwatson) Submitted by: Eugene Grosbein <eugen@eg.sd.rdtc.ru> MFC after: 5 days
# 226022	04-Oct-2011	kib	Move parts of the commit log for r166167, where Tor explained the interaction between vnode locks and vfs_busy(), into comment. MFC after: 1 week
# 225177	25-Aug-2011	attilio	Fix a deficiency in the selinfo interface: If a selinfo object is recorded (via selrecord()) and then it is quickly destroyed, with the waiters missing the opportunity to awake, at the next iteration they will find the selinfo object destroyed, causing a PF#. That happens because the selinfo interface has no way to drain the waiters before to destroy the registered selinfo object. Also this race is quite rare to get in practice, because it would require a selrecord(), a poll request by another thread and a quick destruction of the selrecord()'ed selinfo object. Fix this by adding the seldrain() routine which should be called before to destroy the selinfo objects (in order to avoid such case), and fix the present cases where it might have already been called. Sometimes, the context is safe enough to prevent this type of race, like it happens in device drivers which installs selinfo objects on poll callbacks. There, the destruction of the selinfo object happens at driver detach time, when all the filedescriptors should be already closed, thus there cannot be a race. For this case, mfi(4) device driver can be set as an example, as it implements a full correct logic for preventing this from happening. Sponsored by: Sandvine Incorporated Reported by: rstone Tested by: pluknet Reviewed by: jhb, kib Approved by: re (bz) MFC after: 3 weeks
# 224294	24-Jul-2011	mckusick	Move the MNTK_SUJ flag in mnt_kern_flag to MNT_SUJ in mnt_flag so that it is visible to userland programs. This change enables the `mount' command with no arguments to be able to show if a filesystem is mounted using journaled soft updates as opposed to just normal soft updates. Approved by: re (bz)
# 223677	29-Jun-2011	alc	Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this option to vm_object_page_remove() asserts that the specified range of pages is not mapped, or more precisely that none of these pages have any managed mappings. Thus, vm_object_page_remove() need not call pmap_remove_all() on the pages. This change not only saves time by eliminating pointless calls to pmap_remove_all(), but it also eliminates an inconsistency in the use of pmap_remove_all() versus related functions, like pmap_remove_write(). It eliminates harmless but pointless calls to pmap_remove_all() that were being performed on PG_UNMANAGED pages. Update all of the existing assertions on pmap_remove_all() to reflect this change. Reviewed by: kib
# 223505	24-Jun-2011	jonathan	Tidy up a capabilities-related comment. This comment refers to an #ifdef that hasn't been merged [yet?]; remove it. Approved by: rwatson
# 221829	13-May-2011	mdf	Use a name instead of a magic number for kern_yield(9) when the priority should not change. Fetch the td_user_pri under the thread lock. This is probably not necessary but a magic number also seems preferable to knowing the implementation details here. Requested by: Jason Behmer < jason DOT behmer AT isilon DOT com >
# 221173	28-Apr-2011	attilio	Add the watchdogs patting during the (shutdown time) disk syncing and disk dumping. With the option SW_WATCHDOG on, these operations are doomed to let watchdog fire, fi they take too long. I implemented the stubs this way because I really want wdog_kern_* KPI to not be dependant by SW_WATCHDOG being on (and really, the option only enables watchdog activation in hardclock) and also avoid to call them when not necessary (avoiding not-volountary watchdog activations). Sponsored by: Sandvine Incorporated Discussed with: emaste, des MFC after: 2 weeks
# 220967	23-Apr-2011	rmacklem	Fix a LOR in vfs_busy() where, after msleeping, it would lock the mutexes in the wrong order for the case where the MBF_MNTLSTLOCK is set. I believe this did have the potential for deadlock. For example, if multiple nfsd threads called vfs_busyfs(), which calls vfs_busy() with MBF_MNTLSTLOCK. Thanks go to pho for catching this during his testing. Tested by: pho Submitted by: kib MFC after: 2 weeks
# 220328	04-Apr-2011	pluknet	Remove malloc type M_NETADDR unused since splitting into vfs_subr.c and vfs_export.c. MFC after: 1 week
# 219396	08-Mar-2011	kib	Do not assert buffer lock in VFS_STRATEGY() when kernel already paniced. Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 218424	08-Feb-2011	mdf	Based on discussions on the svn-src mailing list, rework r218195: - entirely eliminate some calls to uio_yeild() as being unnecessary, such as in a sysctl handler. - move should_yield() and maybe_yield() to kern_synch.c and move the prototypes from sys/uio.h to sys/proc.h - add a slightly more generic kern_yield() that can replace the functionality of uio_yield(). - replace source uses of uio_yield() with the functional equivalent, or in some cases do not change the thread priority when switching. - fix a logic inversion bug in vlrureclaim(), pointed out by bde@. - instead of using the per-cpu last switched ticks, use a per thread variable for should_yield(). With PREEMPTION, the only reasonable use of this is to determine if a lock has been held a long time and relinquish it. Without PREEMPTION, this is essentially the same as the per-cpu variable.
# 218195	02-Feb-2011	mdf	Put the general logic for being a CPU hog into a new function should_yield(). Use this in various places. Encapsulate the common case of check-and-yield into a new function maybe_yield(). Change several checks for a magic number of iterations to use should_yield() instead. MFC after: 1 week
# 217824	25-Jan-2011	kib	When vtruncbuf() iterates over the vnode buffer list, lock buffer object before checking the validity of the next buffer pointer. Otherwise, the buffer might be reclaimed after the check, causing iteration to run into wrong buffer. Reported and tested by: pho MFC after: 1 week
# 217555	18-Jan-2011	mdf	Specify a CTLTYPE_FOO so that a future sysctl(8) change does not need to rely on the format string.
# 217326	12-Jan-2011	mdf	sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly. Commit the kernel changes.
# 217076	06-Jan-2011	jhb	- Restore dropping the priority of syncer down to PPAUSE when it is idle. This was lost when it was converted to using a condition variable instead of lbolt. - Drop the priority of flowtable down to PPAUSE when it is idle as well since it is a similar background task. MFC after: 2 weeks
# 216733	27-Dec-2010	kib	Teach ddb "show mount" about MNTK_SUJ flag.
# 215797	24-Nov-2010	kib	Allow shared-locked vnode to be passed to vunref(9). When shared-locked vnode is supplied as an argument to vunref(9) and resulting usecount is 0, set VI_OWEINACT and do not try to upgrade vnode lock. The later could cause vnode unlock, allowing the vnode to be reclaimed meantime. Tested by: pho MFC after: 1 week
# 215548	19-Nov-2010	kib	Remove prtactive variable and related printf()s in the vop_inactive and vop_reclaim() methods. They seems to be unused, and the reported situation is normal for the forced unmount. MFC after: 1 week X-MFC-note: keep prtactive symbol in vfs_subr.c
# 215304	14-Nov-2010	brucec	Fix some more style(9) issues.
# 215283	14-Nov-2010	brucec	Fix style(9) issues from r215281 and r215282. MFC after: 1 week
# 215282	14-Nov-2010	brucec	Add descriptions to some more sysctls. PR: kern/148510 MFC after: 1 week
# 212466	11-Sep-2010	kib	Protect mnt_syncer with the sync_mtx. This prevents a (rare) vnode leak when mount and update are executed in parallel. Encapsulate syncer vnode deallocation into the helper function vfs_deallocate_syncvnode(), to not externalize sync_mtx from vfs_subr.c. Found and reviewed by: jh (previous version of the patch) Tested by: pho MFC after: 3 weeks
# 212096	01-Sep-2010	emaste	As long as we are going to panic anyway, there's no need to hide additional information behind DIAGNOSTIC.
# 212002	30-Aug-2010	jh	execve(2) has a special check for file permissions: a file must have at least one execute bit set, otherwise execve(2) will return EACCES even for an user with PRIV_VFS_EXEC privilege. Add the check also to vaccess(9), vaccess_acl_nfs4(9) and vaccess_acl_posix1e(9). This makes access(2) to better agree with execve(2). Because ZFS doesn't use vaccess(9) for VEXEC, add the check to zfs_freebsd_access() too. There may be other file systems which are not using vaccess*() functions and need to be handled separately. PR: kern/125009 Reviewed by: bde, trasz Approved by: pjd (ZFS part)
# 211930	28-Aug-2010	pjd	There is a bug in vfs_allocate_syncvnode() failure handling in mount code. Actually it is hard to properly handle such a failure, especially in MNT_UPDATE case. The only reason for the vfs_allocate_syncvnode() function to fail is getnewvnode() failure. Fortunately it is impossible for current implementation of getnewvnode() to fail, so we can assert this and make vfs_allocate_syncvnode() void. This in turn free us from handling its failures in the mount code. Reviewed by: kib MFC after: 1 month
# 211213	12-Aug-2010	kib	The buffers b_vflags field is not always properly protected by bufobj lock. If b_bufobj is not NULL, then bufobj lock should be held when manipulating the flags. Not doing this sometimes leaves BV_BKGRDINPROG to be erronously set, causing softdep' getdirtybuf() to stuck indefinitely in "getbuf" sleep, waiting for background write to finish which is not actually performed. Add BO_LOCK() in the cases where it was missed. In collaboration with: pho Tested by: bz Reviewed by: jeff MFC after: 1 month
# 210837	04-Aug-2010	alc	In order for MAXVNODES_MAX to be an "int" on powerpc and sparc, we must cast PAGE_SIZE to an "int". (Powerpc and sparc, unlike the other architectures, define PAGE_SIZE as a "long".) Submitted by: Andreas Tobler
# 210782	02-Aug-2010	alc	Update the "desiredvnodes" calculation. In particular, make the part of the calculation that is based on the kernel's heap size more conservative. Hopefully, this will eliminate the need for MAXVNODES_MAX, but for the time being set MAXVNODES_MAX to a large value. Reviewed by: jhb@ MFC after: 6 weeks
# 209390	21-Jun-2010	ed	Use ISO C99 integer types in sys/kern where possible. There are only about 100 occurences of the BSD-specific u_int*_t datatypes in sys/kern. The ISO C99 integer types are used here more often.
# 209260	17-Jun-2010	pjd	Backout r207970 for now, it can lead to deadlocks. Reported by: kan MFC after: 3 days
# 208773	03-Jun-2010	kib	Sometimes vnodes share the lock despite being different vnodes on different mount points, e.g. the nullfs vnode and the covered vnode from the lower filesystem. In this case, existing assertion in vop_rename_pre() may be triggered. Check for vnode locks equiality instead of the vnodes itself to not trip over the situation. Submitted by: Mikolaj Golub <to.my.trociny@gmail.com> Tested by: pho MFC after: 2 weeks
# 208003	12-May-2010	zml	Add VOP_ADVLOCKPURGE so that the file system is called when purging locks (in the case where the VFS impl isn't using lf_*) Submitted by: Matthew Fleming <matthew.fleming@isilon.com> Reviewed by: zml, dfr
# 207970	12-May-2010	pjd	When there is no memory or KVA, try to help by reclaiming some vnodes. This helps with 'kmem_map too small' panics. No objections from: kib Tested by: Alexander V. Ribchansky <shurik@zk.informjust.ua> MFC after: 1 week
# 207937	11-May-2010	pjd	I added vfs_lowvnodes event, but it was only used for a short while and now it is totally unused. Remove it. MFC after: 3 days
# 207141	24-Apr-2010	jeff	- Merge soft-updates journaling from projects/suj/head into head. This brings in support for an optional intent log which eliminates the need for background fsck on unclean shutdown. Sponsored by: iXsystems, Yahoo!, and Juniper. With help from: McKusick and Peter Holm
# 206160	04-Apr-2010	jh	Add missing MNT_NFS4ACLS.
# 206135	03-Apr-2010	pjd	Fix some whitespace nits.
# 206134	03-Apr-2010	pjd	Add missing mnt_kern_flag flags in 'show mount' output.
# 206093	02-Apr-2010	kib	Add function vop_rename_fail(9) that performs needed cleanup for locks and references of the VOP_RENAME(9) arguments. Use vop_rename_fail() in deadfs_rename(). Tested by: Mikolaj Golub MFC after: 1 week
# 202528	17-Jan-2010	kib	Add new function vunref(9) that decrements vnode use count (and hold count) while vnode is exclusively locked. The code for vput(9), vrele(9) and vunref(9) is merged. In collaboration with: pho Reviewed by: alc MFC after: 3 weeks
# 201134	28-Dec-2009	kib	Add a knob to allow reclaim of the directory vnodes that are source of the namecache records. The reclamation is not enabled by default because for typical workload it would make namecache unusable, but large nested directory tree easily puts any process that accesses filesystem into 1 second wait for vlru. Reported by: yar (long time ago) MFC after: 3 days
# 201019	26-Dec-2009	trasz	Now that all the callers seem to be fixed, add KASSERTs to make sure VAPPEND is not being used improperly.
# 200770	21-Dec-2009	kib	VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object flag. Besides providing the redundand information, need to update both vnode and object flags causes more acquisition of vnode interlock. OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects. Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for vnode-backed vm objects. Suggested and reviewed by: alc Tested by: pho MFC after: 3 weeks
# 199529	19-Nov-2009	jh	Extend ddb(4) "show mount" command to print active string mount options. Note that only option names are printed, not values. Reviewed by: pjd Approved by: trasz (mentor) MFC after: 2 weeks
# 197680	01-Oct-2009	trasz	Provide default implementation for VOP_ACCESS(9), so that filesystems which want to provide VOP_ACCESSX(9) don't have to implement both. Note that this commit makes implementation of either of these two mandatory. Reviewed by: kib
# 197134	12-Sep-2009	rwatson	Use C99 initialization for struct filterops. Obtained from: Mac OS X Sponsored by: Apple Inc. MFC after: 3 weeks
# 197030	09-Sep-2009	kib	In vfs_mark_atime(9), be resistent against reclaimed vnodes. Assert that neccessary locks are taken, since vop might not be called. Tested by: pho MFC after: 3 days
# 195285	02-Jul-2009	jamie	Call prison_check from vfs_suser rather than re-implementing it. Approved by: re (kib), bz (mentor)
# 193951	10-Jun-2009	kib	Adapt vfs kqfilter to the shared vnode lock used by zfs write vop. Use vnode interlock to protect the knote fields [1]. The locking assumes that shared vnode lock is held, thus we get exclusive access to knote either by exclusive vnode lock protection, or by shared vnode lock + vnode interlock. Do not use kl_locked() method to assert either lock ownership or the fact that curthread does not own the lock. For shared locks, ownership is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared lock not owned by curthread, causing false positives in kqueue subsystem assertions about knlist lock. Remove kl_locked method from knlist lock vector, and add two separate assertion methods kl_assert_locked and kl_assert_unlocked, that are supposed to use proper asserts. Change knlist_init accordingly. Add convenience function knlist_init_mtx to reduce number of arguments for typical knlist initialization. Submitted by: jhb [1] Noted by: jhb [2] Reviewed by: jhb Tested by: rnoland
# 193511	05-Jun-2009	rwatson	Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd
# 193138	31-May-2009	attilio	Remove the now invalid (and possibly unused) debug.mpsafevfs sysctl/tunable. Reviewed by: emaste Sponsored by: Sandvine Incorporated
# 193092	30-May-2009	trasz	Add VOP_ACCESSX, which can be used to query for newly added V* permissions, such as VWRITE_ACL. For a filsystems that don't implement it, there is a default implementation, which works as a wrapper around VOP_ACCESS. Reviewed by: rwatson@
# 192895	27-May-2009	jamie	Add hierarchical jails. A jail may further virtualize its environment by creating a child jail, which is visible to that jail and to any parent jails. Child jails may be restricted more than their parents, but never less. Jail names reflect this hierarchy, being MIB-style dot-separated strings. Every thread now points to a jail, the default being prison0, which contains information about the physical system. Prison0's root directory is the same as rootvnode; its hostname is the same as the global hostname, and its securelevel replaces the global securelevel. Note that the variable "securelevel" has actually gone away, which should not cause any problems for code that properly uses securelevel_gt() and securelevel_ge(). Some jail-related permissions that were kept in global variables and set via sysctls are now per-jail settings. The sysctls still exist for backward compatibility, used only by the now-deprecated jail(2) system call. Approved by: bz (mentor)
# 191990	11-May-2009	attilio	Remove the thread argument from the FSD (File-System Dependent) parts of the VFS. Now all the VFS_* functions and relating parts don't want the context as long as it always refers to curthread. In some points, in particular when dealing with VOPs and functions living in the same namespace (eg. vflush) which still need to be converted, pass curthread explicitly in order to retain the old behaviour. Such loose ends will be fixed ASAP. While here fix a bug: now, UFS_EXTATTR can be compiled alone without the UFS_EXTATTR_AUTOSTART option. VFS KPI is heavilly changed by this commit so thirdy parts modules needs to be recompiled. Bump __FreeBSD_version in order to signal such situation.
# 190533	29-Mar-2009	kan	Replace v_dd vnode pointer with v_cache_dd pointer to struct namecache in directory vnodes. Allow namecache dotdot entry to be created pointing from child vnode to parent vnode if no existing links in opposite direction exist. Use direct link from parent to child for dotdot lookups otherwise. This restores more efficient dotdot caching in NFS filesystems which was lost when vnodes stoppped being type stable. Reviewed by: kib
# 189287	02-Mar-2009	kan	Change vfs_busy to wait until an outcome of pending unmount operation is known and to retry or fail accordingly to that outcome. This fixes the problem with namespace traversing programs failing with random ENOENT errors if someone just happened to try to unmount that same filesystem at the same time. Reported by: dhw Reviewed by: kib, attilio Sponsored by: Juniper Networks, Inc.
# 188244	06-Feb-2009	jhb	Tweak the output of VOP_PRINT/vn_printf() some. - Align the fifo output in fifo_print() with other vn_printf() output. - Remove the leading space from lockmgr_printinfo() so its output lines up in vn_printf(). - lockmgr_printinfo() now ends with a newline, so remove an extra newline from vn_printf().
# 188243	06-Feb-2009	trasz	Add KASSERTs to make it easier to debug problems like the one fixed in r188141. Reviewed by: kib,attilio Approved by: rwatson (mentor) Tested by: pho Sponsored by: FreeBSD Foundation
# 188150	05-Feb-2009	attilio	Add more KTR_VFS logging point in order to have a more effective tracing. Reviewed by: brueffer, kib Tested by: Gianni Trematerra <giovanni D trematerra A gmail D com>
# 187654	23-Jan-2009	jhb	Tweak the wording for vfs_mark_atime() since the I/O it is avoiding by not updating va_atime via VOP_SETATTR() isn't always synchronous. For some filesystems it is asynchronous. Suggested by: bde
# 187653	23-Jan-2009	jhb	Push down Giant in the vlnru kproc main loop so that it is only acquired around calls to vlrureclaim() on non-MPSAFE filesystems. Specifically, vnlru no longer needs Giant for the common case of waking up and deciding there is nothing for it to do. MFC after: 2 weeks
# 187564	21-Jan-2009	jhb	Fix a few style bogons. Submitted by: bde
# 187526	21-Jan-2009	jhb	Move the VA_MARKATIME flag for VOP_SETATTR() out into its own VOP: VOP_MARKATIME() since unlike the rest of VOP_SETATTR(), VA_MARKATIME can be performed while holding a shared vnode lock (the same functionality is done internally by VOP_READ which can run with a shared vnode lock). Add missing locking of the vnode interlock to the ufs implementation and remove a special note and test from the NFS client about not supporting the feature. Inspired by: ups Tested by: pho
# 187467	20-Jan-2009	kib	FFS puts the extended attributes blocks at the negative blocks for the vnode, from -1 down. When vinvalbuf(vp, V_ALT) is done for the vnode, it incorrectly does vm_object_page_remove(0, 0), removing all pages from the underlying vm object, not only the pages that back the extended attributes data. Change vinvalbuf() to not remove any pages from the object when V_NORMAL or V_ALT are specified. Instead, the only in-tree caller in ffs_inode.c:ffs_truncate() that specifies V_ALT explicitely removes the corresponding page range. The V_NORMAL caller does vnode_pager_setsize(vp, 0) immediately after the call to vinvalbuf(V_NORMAL) already. Reported by: csjp Reviewed by: ups MFC after: 3 weeks
# 186197	16-Dec-2008	attilio	1) Fix a deadlock in the VFS: - threadA runs vfs_rel(mp1) - threadB does unmount the mp1 fs, sets MNTK_UNMOUNT and drop MNT_ILOCK() - threadA runs vfs_busy(mp1) and, as long as, MNTK_UNMOUNT is set, sleeps waiting for threadB to complete the unmount - threadB, in vfs_mount_destroy(), finds mnt_lock > 0 and sleeps waiting for the refcount to expire. Fix the deadlock by adding a flag called MNTK_REFEXPIRE which signals the unmounter is waiting for mnt_ref to expire. The vfs_busy contenders got awake, fails, and if they retry the MNTK_REFEXPIRE won't allow them to sleep again. 2) Simplify significantly the code of vfs_mount_destroy() trimming unnecessary codes: - as long as any reference exited, it is no-more possible to have write-op (primarty and secondary) in progress. - it is no needed to drop and reacquire the mount lock. - filling the structures with dummy values is unuseful as long as it is going to be freed. Tested by: pho, Andrea Barberio <insomniac at slackware dot it> Discussed with: kib
# 185432	29-Nov-2008	kib	In the nfsrv_fhtovp(), after the vfs_getvfs() function found the pointer to the fs, but before a vnode on the fs is locked, unmount may free fs structures, causing access to destroyed data and freed memory. Introduce a vfs_busymp() function that looks up and busies found fs while mountlist_mtx is held. Use it in nfsrv_fhtovp() and in the implementation of the handle syscalls. Two other uses of the vfs_getvfs() in the vfs_subr.c, namely in sysctl_vfs_ctl and vfs_getnewfsid seems to be ok. In particular, sysctl_vfs_ctl is protected by Giant by being a non-sleeping sysctl handler, that prevents Giant-locked unmount code to interfere with it. Noted by: tegge Reviewed by: dfr Tested by: pho MFC after: 1 month
# 185029	17-Nov-2008	pjd	Update ZFS from version 6 to 13 and bring some FreeBSD-specific changes. This bring huge amount of changes, I'll enumerate only user-visible changes: - Delegated Administration Allows regular users to perform ZFS operations, like file system creation, snapshot creation, etc. - L2ARC Level 2 cache for ZFS - allows to use additional disks for cache. Huge performance improvements mostly for random read of mostly static content. - slog Allow to use additional disks for ZFS Intent Log to speed up operations like fsync(2). - vfs.zfs.super_owner Allows regular users to perform privileged operations on files stored on ZFS file systems owned by him. Very careful with this one. - chflags(2) Not all the flags are supported. This still needs work. - ZFSBoot Support to boot off of ZFS pool. Not finished, AFAIK. Submitted by: dfr - Snapshot properties - New failure modes Before if write requested failed, system paniced. Now one can select from one of three failure modes: - panic - panic on write error - wait - wait for disk to reappear - continue - serve read requests if possible, block write requests - Refquota, refreservation properties Just quota and reservation properties, but don't count space consumed by children file systems, clones and snapshots. - Sparse volumes ZVOLs that don't reserve space in the pool. - External attributes Compatible with extattr(2). - NFSv4-ACLs Not sure about the status, might not be complete yet. Submitted by: trasz - Creation-time properties - Regression tests for zpool(8) command. Obtained from: OpenSolaris
# 184599	03-Nov-2008	attilio	Remove the mnt_holdcnt and mnt_holdcntwaiters because they are useless. Really, the concept of holdcnt in the struct mount is rappresented by the mnt_ref (which prevents the type-stable structure from being "recycled) handled through vfs_ref() and vfs_rel(). On this optic, switch the holdcnt acquisition into an emulated vfs_ref() (and subsequent release into vfs_rel()). Discussed with: kib Tested by: pho
# 184554	02-Nov-2008	attilio	Improve VFS locking: - Implement real draining for vfs consumers by not relying on the mnt_lock and using instead a refcount in order to keep track of lock requesters. - Due to the change above, remove the mnt_lock lockmgr because it is now useless. - Due to the change above, vfs_busy() is no more linked to a lockmgr. Change so its KPI by removing the interlock argument and defining 2 new flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the old version (which was unlinked from the lockmgr alredy) and MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx once the mnt interlock is held (ability still desired by most consumers). - The stub used into vfs_mount_destroy(), that allows to override the mnt_ref if running for more than 3 seconds, make it totally useless. Remove it as it was thought to work into older versions. If a problem of "refcount held never going away" should appear, we will need to fix properly instead than trust on such hackish solution. - Fix a bug where returning (with an error) from dounmount() was still leaving the MNTK_MWAIT flag on even if it the waiters were actually woken up. Just a place in vfs_mount_destroy() is left because it is going to recycle the structure in any case, so it doesn't matter. - Remove the markercnt refcount as it is useless. This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and __FreeBSD_version will be modified accordingly. Discussed with: kib Tested by: pho
# 184413	28-Oct-2008	trasz	Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary to add more V* constants, and the variables changed by this patch were often being assigned to mode_t variables, which is 16 bit. Approved by: rwatson (mentor)
# 184411	28-Oct-2008	kib	Style return statements in vn_pollrecord().
# 184409	28-Oct-2008	kib	Protect check for v_pollinfo == NULL and assignment of the newly allocated vpollinfo with vnode interlock. Fully initialize vpollinfo before putting pointer to it into vp->v_pollinfo. Discussed with: dwhite Tested by: pho MFC after: 1 week
# 184073	20-Oct-2008	kib	In vfs_busy(), lockmgr() cannot legitimately sleep, because code checked MNTK_UNMOUNT before, and mnt_mtx is used as interlock. vfs_busy() always tries to obtain a shared lock on mnt_lock, the other user is unmount who tries to drain it, setting MNTK_UNMOUNT before. Reviewed by: tegge, attilio Tested by: pho MFC after: 2 weeks
# 183754	10-Oct-2008	attilio	Remove the struct thread unuseful argument from bufobj interface. In particular following functions KPI results modified: - bufobj_invalbuf() - bufsync() and BO_SYNC() "virtual method" of the buffer objects set. Main consumers of bufobj functions are affected by this change too and, in particular, functions which changed their KPI are: - vinvalbuf() - g_vfs_close() Due to the KPI breakage, __FreeBSD_version will be bumped in a later commit. As a side note, please consider just temporary the 'curthread' argument passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP Reviewed by: kib Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
# 182542	31-Aug-2008	attilio	Decontextualize vfs_busy(), vfs_unbusy() and vfs_mount_alloc() functions. Manpages are updated accordingly. Tested by: Diego Sardina <siarodx at gmail dot com>
# 182371	28-Aug-2008	attilio	Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread was always curthread and totally unuseful. Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
# 182364	28-Aug-2008	kib	Introduce the VV_FORCEINSMQ vnode flag. It instructs the insmnque() function to ignore the unmounting and forces insertion of the vnode into the mount vnode list. Change insmntque() to fail when forced unmount is in progress and VV_FORCEINSMQ is not specified. Add an assertion to the insmntque(), requiring the vnode to be exclusively locked for mp-safe filesystems. Use the VV_FORCEINSMQ for the creation of the syncvnode. Tested by: pho Reviewed by: tegge MFC after: 1 month
# 182120	24-Aug-2008	csjp	Remove worrying printf warning on bootup when processing vnodes which have NULL mount-points. This is the case for special vnodes, such as the one used in nameiinit() which is used for crossing mount points in lookup() to avoid lock ordering issues. MFC after: 2 weeks Discussed with: rwatson, kib
# 180995	30-Jul-2008	ed	Remove the use of lbolt from the VFS syncer. It seems we only use `lbolt' inside the VFS syncer and the TTY layer now. Because I'm planning to replace the TTY layer next month, there's no reason to keep `lbolt' if it's only used in a single thread inside the kernel. Because the syncer code wanted to wake up the syncer thread before the timeout, it called sleepq_remove(). Because we now just use a condvar(9) with a timeout value of `hz', we can wake it up using cv_broadcast() without waking up any unrelated threads. Reviewed by: phk
# 180844	27-Jul-2008	pjd	Assert for exclusive vnode lock in vinactive(), vrecycle() and vgonel() functions. Reviewed by: kib
# 180843	27-Jul-2008	pjd	- Move vp test for beeing NULL under IGNORE_LOCK(). - Check if panicstr isn't set, if it is ignore the lock. This helps to avoid confusion, because lockmgr is a no-op when panicstr isn't NULL, so asserting anything at this point doesn't make sense and can just race with other panic. Discussed with: kib
# 180682	21-Jul-2008	attilio	- Disallow XFS mounting in write mode. The write support never worked really and there is no need to maintain it. - Fix vn_get() in order to let it call vget(9) with a valid locking request. vget(9) returns the vnode locked in order to prevent recycling, but in this case internal XFS locks alredy prevent it from happening, so it is safe to drop the vnode lock before to return by vn_get(). - Add a VNASSERT() in vget(9) in order to catch malformed locking requests. Discussed with: kan, kib Tested by: Lothar Braun <lothar at lobraun dot de>
# 179093	18-May-2008	pjd	Be more friendly for DDB pager. Educated by: jhb's BSDCan presentation
# 178761	04-May-2008	attilio	sync_vnode() has some messy code about locking in order to deal with mount fs needing Giant to be held when processing bufobjs. Use a different subqueue for pending workitems on filesystems requiring Giant. This simplifies the code notably and also reduces the number of Giant acquisitions (and the whole processing cost). Suggested by: jeff Reviewed by: kib Tested by: pho
# 178585	26-Apr-2008	pjd	Implement 'show mount' command in DDB. Without argument, it prints short info about all currently mounted file systems. When an address is given as an argument, prints detailed info about the given mount point. MFC after: 2 weeks
# 178458	24-Apr-2008	kib	Allow the vnode zone to return the unused memory. The vnode reference count is/shall be properly maintained for the long time, and VFS shall be safe against the vnode memory reclamation. Proposed by: jeff Tested by: pho
# 178243	16-Apr-2008	kib	Move the head of byte-level advisory lock list from the filesystem-specific vnode data to the struct vnode. Provide the default implementation for the vop_advlock and vop_advlockasync. Purge the locks on the vnode reclaim by using the lf_purgelocks(). The default implementation is augmented for the nfs and smbfs. In the nfs_advlock, push the Giant inside the nfs_dolock. Before the change, the vop_advlock and vop_advlockasync have taken the unlocked vnode and dereferenced the fs-private inode data, racing with with the vnode reclamation due to forced unmount. Now, the vop_getattr under the shared vnode lock is used to obtain the inode size, and later, in the lf_advlockasync, after locking the vnode interlock, the VI_DOOMED flag is checked to prevent an operation on the doomed vnode. The implementation of the lf_purgelocks() is submitted by dfr. Reported by: kris Tested by: kris, pho Discussed with: jeff, dfr MFC after: 2 weeks
# 177857	02-Apr-2008	jeff	- Destroy the bo mtx when the vnode is destroyed.
# 177687	28-Mar-2008	attilio	b_waiters cannot be adequately protected by the interlock because it is dropped after the call to lockmgr() so just revert this approach using something similar to the precedent one: BUF_LOCKWAITERS() just checks if there are waiters (not the actual number of them) and it is based on newly introduced lockmgr_waiters() which returns if the lockmgr has waiters or not. The name has been choosen differently by old lockwaiters() in order to not confuse them. KPI results enriched by this commit so __FreeBSD_version bumping and manpage update will be happening soon. 'struct buf' also changes, so kernel ABI is disturbed. Bug found by: jeff Approved by: jeff, kib
# 177539	24-Mar-2008	jeff	- Greatly simplify vget() by removing the guarantee that any new references to a vnode with VI_OWEINACT set will force the vinactive() call. The kernel makes no guarantees about which reference was the last to close a file or when the actual inactive processing will happen. The previous code was designed to preserve existing semantics in the face of shared locks, however, this was unnecessary. Discussed with: mckusick
# 177513	23-Mar-2008	jeff	- Only return 1 from sync_vnode() in cases where the vnode is still at the head of the sync list. This prevents sched_sync() from re-queueing a vnode which may have been freed already. Discussed with: kib
# 177511	23-Mar-2008	jeff	- Pass BO_MTX(bo) to lockmgr in vtruncbuf, we don't own the vnode interlock here anymore. Reported by: kris
# 177493	22-Mar-2008	jeff	- Complete part of the unfinished bufobj work by consistently using BO_LOCK/UNLOCK/MTX when manipulating the bufobj. - Create a new lock in the bufobj to lock bufobj fields independently. This leaves the vnode interlock as an 'identity' lock while the bufobj is an io lock. The bufobj lock is ordered before the vnode interlock and also before the mnt ilock. - Exploit this new lock order to simplify softdep_check_suspend(). - A few sync related functions are marked with a new XXX to note that we may not properly interlock against a non-zero bv_cnt when attempting to sync all vnodes on a mountlist. I do not believe this race is important. If I'm wrong this will make these locations easier to find. Reviewed by: kib (earlier diff) Tested by: kris, pho (earlier diff)
# 177253	16-Mar-2008	rwatson	In keeping with style(9)'s recommendations on macros, use a ';' after each SYSINIT() macro invocation. This makes a number of lightweight C parsers much happier with the FreeBSD kernel source, including cflow's prcc and lxr. MFC after: 1 month Discussed with: imp, rink
# 176708	01-Mar-2008	attilio	- Handle buffer lock waiters count directly in the buffer cache instead than rely on the lockmgr support [1]: * bump the waiters only if the interlock is held * let brelvp() return the waiters count * rely on brelvp() instead than BUF_LOCKWAITERS() in order to check for the waiters number - Remove a namespace pollution introduced recently with lockmgr.h including lock.h by including lock.h directly in the consumers and making it mandatory for using lockmgr. - Modify flags accepted by lockinit(): * introduce LK_NOPROFILE which disables lock profiling for the specified lockmgr * introduce LK_QUIET which disables ktr tracing for the specified lockmgr [2] * disallow LK_SLEEPFAIL and LK_NOWAIT to be passed there so that it can only be used on a per-instance basis - Remove BUF_LOCKWAITERS() and lockwaiters() as they are no longer used This patch breaks KPI so __FreBSD_version will be bumped and manpages updated by further commits. Additively, 'struct buf' changes results in a disturbed ABI also. [2] Really, currently there is no ktr tracing in the lockmgr, but it will be added soon. [1] Submitted by: kib Tested by: pho, Andrea Barberio <insomniac at slackware dot it>
# 176559	25-Feb-2008	attilio	Axe the 'thread' argument from VOP_ISLOCKED() and lockstatus() as it is always curthread. As KPI gets broken by this patch, manpages and __FreeBSD_version will be updated by further commits. Tested by: Andrea Barberio <insomniac at slackware dot it>
# 176116	08-Feb-2008	attilio	Conver all explicit instances to VOP_ISLOCKED(arg, NULL) into VOP_ISLOCKED(arg, curthread). Now, VOP_ISLOCKED() and lockstatus() should only acquire curthread as argument; this will lead in axing the additional argument from both functions, making the code cleaner. Reviewed by: jeff, kib
# 175635	24-Jan-2008	attilio	Cleanup lockmgr interface and exported KPI: - Remove the "thread" argument from the lockmgr() function as it is always curthread now - Axe lockcount() function as it is no longer used - Axe LOCKMGR_ASSERT() as it is bogus really and no currently used. Hopefully this will be soonly replaced by something suitable for it. - Remove the prototype for dumplockinfo() as the function is no longer present Addictionally: - Introduce a KASSERT() in lockstatus() in order to let it accept only curthread or NULL as they should only be passed - Do a little bit of style(9) cleanup on lockmgr.h KPI results heavilly broken by this change, so manpages and FreeBSD_version will be modified accordingly by further commits. Tested by: matteo
# 175486	19-Jan-2008	attilio	- Introduce the function lockmgr_recursed() which returns true if the lockmgr lkp, when held in exclusive mode, is recursed - Introduce the function BUF_RECURSED() which does the same for bufobj locks based on the top of lockmgr_recursed() - Introduce the function BUF_ISLOCKED() which works like the counterpart VOP_ISLOCKED(9), showing the state of lockmgr linked with the bufobj BUF_RECURSED() and BUF_ISLOCKED() entirely replace the usage of bogus BUF_REFCNT() in a more explicative and SMP-compliant way. This allows us to axe out BUF_REFCNT() and leaving the function lockcount() totally unused in our stock kernel. Further commits will axe lockcount() as well as part of lockmgr() cleanup. KPI results, obviously, broken so further commits will update manpages and freebsd version. Tested by: kris (on UFS and NFS)
# 175294	13-Jan-2008	attilio	VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in conjuction with 'thread' argument passing which is always curthread. Remove the unuseful extra-argument and pass explicitly curthread to lower layer functions, when necessary. KPI results broken by this change, which should affect several ports, so version bumping and manpage update will be further committed. Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
# 175202	10-Jan-2008	attilio	vn_lock() is currently only used with the 'curthread' passed as argument. Remove this argument and pass curthread directly to underlying VOP_LOCK1() VFS method. This modify makes the code cleaner and in particular remove an annoying dependence helping next lockmgr() cleanup. KPI results, obviously, changed. Manpage and FreeBSD_version will be updated through further commits. As a side note, would be valuable to say that next commits will address a similar cleanup about VFS methods, in particular vop_lock1 and vop_unlock. Tested by: Diego Sardina <siarodx at gmail dot com>, Andrea Di Pasquale <whyx dot it at gmail dot com>
# 174952	28-Dec-2007	rwatson	In "show lockedvnods" DDB command, use db_printf() rather than printf() so that the results end up in the DDB output stream rather than the console output stream. This should likely also be done for the vprint() function it calls. MFC after: 3 months
# 174943	27-Dec-2007	attilio	As LK_EXCLUPGRADE is used in conjuction with LK_NOWAIT, LK_UPGRADE becames equivalent with this and so operate the switch. That call is the only one remaining LK_EXCLUPGRADE consumer and removing it will prepare the ground for LK_EXCLUPGRADE axing and further lockmgr improvements. Discussed with: jeff, ups
# 174898	25-Dec-2007	rwatson	Add a new 'why' argument to kdb_enter(), and a set of constants to use for that argument. This will allow DDB to detect the broad category of reason why the debugger has been entered, which it can use for the purposes of deciding which DDB script to run. Assign approximate why values to all current consumers of the kdb_enter() interface.
# 174284	05-Dec-2007	kib	Use curthread instead of the FIRST_THREAD_IN_PROC for vnlru and syncer, when applicable. Aquire Giant slightly later for vnlru. In the syncer, aquire the Giant only when a vnode belongs to the non-MPsafe fs. In both speedup_syncer() and syncer_shutdown(), remove the syncer thread from the lbolt sleep queue after the syncer state is modified, not before. Herded by: attilio Tested by: Peter Holm Reviewed by: ups MFC after: 1 week
# 172930	24-Oct-2007	rwatson	Merge first in a series of TrustedBSD MAC Framework KPI changes from Mac OS X Leopard--rationalize naming for entry points to the following general forms: mac_<object>_<method/action> mac_<object>_check_<method/action> The previous naming scheme was inconsistent and mostly reversed from the new scheme. Also, make object types more consistent and remove spaces from object types that contain multiple parts ("posix_sem" -> "posixsem") to make mechanical parsing easier. Introduce a new "netinet" object type for certain IPv4/IPv6-related methods. Also simplify, slightly, some entry point names. All MAC policy modules will need to be recompiled, and modules not updates as part of this commit will need to be modified to conform to the new KPI. Sponsored by: SPARTA (original patches against Mac OS X) Obtained from: TrustedBSD Project, Apple Computer
# 172836	20-Oct-2007	julian	Rename the kthread_xxx (e.g. kthread_create()) calls to kproc_xxx as they actually make whole processes. Thos makes way for us to add REAL kthread_create() and friends that actually make theads. it turns out that most of these calls actually end up being moved back to the thread version when it's added. but we need to make this cosmetic change first. I'd LOVE to do this rename in 7.0 so that we can eventually MFC the new kthread_xxx() calls.
# 172151	12-Sep-2007	kib	When restoring the mount after umount failed, the MNTK_UNMOUNT flag prevents insmntque() from placing reallocated syncer vnode on mount list, that causes panic in vfs_allocate_syncvnode(). Introduce MNTK_NOINSMNTQ flag, that marks the period when instmntque is not allowed to success, instead of MNTK_UNMOUNT. The MNTK_NOINSMNTQ is set and cleared simultaneously with MNTK_UNMOUNT, except on umount error path, where it is cleaned just before the syncer vnode is going to be allocated. Reported by: Peter Jeremy <peterjeremy optushome com au> Suggested by: tegge Approved by: re (rwatson)
# 171823	13-Aug-2007	pjd	Improve vn_printf() by: - adding missing vnode flags, - printing unknown flags as numbers, - using strlcat() instead of strcat(). Approved by: re (bmah)
# 170587	12-Jun-2007	rwatson	Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in some cases, move to priv_check() if it was an operation on a thread and no other flags were present. Eliminate caller-side jail exception checking (also now-unused); jail privilege exception code now goes solely in kern_jail.c. We can't yet eliminate suser() due to some cases in the KAME code where a privilege check is performed and then used in many different deferred paths. Do, however, move those prototypes to priv.h. Reviewed by: csjp Obtained from: TrustedBSD Project
# 170170	31-May-2007	attilio	Revert VMCNT_* operations introduction. Probabilly, a general approach is not the better solution here, so we should solve the sched_lock protection problems separately. Requested by: alc Approved by: jeff (mentor)
# 170035	27-May-2007	rwatson	Universally adopt most conventional spelling of acquire.
# 169671	18-May-2007	kib	Since renaming of vop_lock to _vop_lock, pre- and post-condition function calls are no more generated for vop_lock. Rename _vop_lock to vop_lock1 to satisfy tools/vnode_if.awk assumption about vop naming conventions. This restores pre/post-condition calls.
# 169667	18-May-2007	jeff	- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating vmcnts. This can be used to abstract away pcpu details but also changes to use atomics for all counters now. This means sched lock is no longer responsible for protecting counts in the switch routines. Contributed by: Attilio Rao <attilio@FreeBSD.org>
# 168699	13-Apr-2007	pjd	Fix jails and jail-friendly file systems handling: - We need to allow for PRIV_VFS_MOUNT_OWNER inside a jail. - Move security checks to vfs_suser() and deny unmounting and updating for jailed root from different jails, etc. OK'ed by: rwatson
# 168682	13-Apr-2007	pjd	When we are running low on vnodes, there is currently no way to ask other subsystems to release some vnodes. Implement backpressure based on vfs_lowvnodes event (similar to vm_lowmem for memory).
# 168587	10-Apr-2007	pjd	Minor style cleanups (mostly removal of trailing whitespaces).
# 168585	10-Apr-2007	pjd	Correct typos.
# 168204	01-Apr-2007	pjd	Now that the vdropl() function is public, assert that the vnode interlock is held.
# 168192	31-Mar-2007	des	Make vdropl() public; zfs needs it. There is also plenty of existing file system code (mostly _reclaim()) which look like this: VOP_LOCK(vp); / examine vp / VOP_UNLOCK(vp); vdrop(vp); This can now be rewritten to: VOP_LOCK(vp); / examine vp / vdropl(vp); / will unlock vp */ MFC after: 1 week
# 167933	27-Mar-2007	marcel	PowerPC is the only architecture with mpsafe_vfs=0. This is now broken. Rudimentary tests show that PowerPC can run with mpsafe_vfs=1. Make it so...
# 167497	13-Mar-2007	tegge	Make insmntque() externally visibile and allow it to fail (e.g. during late stages of unmount). On failure, the vnode is recycled. Add insmntque1(), to allow for file system specific cleanup when recycling vnode on failure. Change getnewvnode() to no longer call insmntque(). Previously, embryonic vnodes were put onto the list of vnode belonging to a file system, which is unsafe for a file system marked MPSAFE. Change vfs_hash_insert() to no longer lock the vnode. The caller now has that responsibility. Change most file systems to lock the vnode and call insmntque() or insmntque1() after a new vnode has been sufficiently setup. Handle failed insmntque*() calls by propagating errors to callers, possibly after some file system specific cleanup. Approved by: re (kensmith) Reviewed by: kib In collaboration with: kib
# 164248	13-Nov-2006	kmacy	change vop_lock handling to allowing tracking of callers' file and line for acquisition of lockmgr locks Approved by: scottl (standing in for mentor rwatson)
# 164073	07-Nov-2006	jhb	Simplify operations with sync_mtx in sched_sync(): - Don't drop the lock just to reacquire it again to check rushjob, this only wastes time. - Use msleep() to drop the mutex while sleeping instead of explicitly unlocking around tsleep. Reviewed by: pjd
# 164070	07-Nov-2006	jhb	Fix comment typo and function declaration.
# 164033	06-Nov-2006	rwatson	Sweep kernel replacing suser(9) calls with priv(9) calls, assigning specific privilege names to a broad range of privileges. These may require some future tweaking. Sponsored by: nCircle Network Security, Inc. Obtained from: TrustedBSD Project Discussed on: arch@ Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri, Alex Lyashkov <umka at sevcity dot net>, Skip Ford <skip dot ford at verizon dot net>, Antoine Brodin <antoine dot brodin at laposte dot net>
# 163988	04-Nov-2006	pjd	Typo, 'from' vnode is locked here, not 'to' vnode.
# 163841	31-Oct-2006	pjd	Add gjournal specific code to the UFS file system: - Add FS_GJOURNAL flag which enables gjournal support on a file system. - Add cg_unrefs field to the cylinder group structure which holds number of unreferenced (orphaned) inodes in the given cylinder group. - Add fs_unrefs field to the super block structure which holds total number of unreferenced (orphaned) inodes. - When file or a directory is orphaned (last reference is removed, but object is still open), increase fs_unrefs and cg_unrefs fields, which is a hint for fsck in which cylinder groups looks for such (orphaned) objects. - When file is last closed, decrease {fs,cg}_unrefs fields. - Add VV_DELETED vnode flag which points at orphaned objects. Sponsored by: home.pl
# 163606	22-Oct-2006	rwatson	Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now contains the userspace and user<->kernel API and definitions, with all in-kernel interfaces moved to mac_framework.h, which is now included across most of the kernel instead. This change is the first step in a larger cleanup and sweep of MAC Framework interfaces in the kernel, and will not be MFC'd. Obtained from: TrustedBSD Project Sponsored by: SPARTA
# 162945	02-Oct-2006	kib	Correct the comment: numvnodes is decreased on vdestroying the vnode. OKed by: tegge Approved by: pjd (mentor) MFC after: 1 week
# 162649	26-Sep-2006	tegge	Add mnt_noasync counter to better handle interleaved calls to nmount(), sync() and sync_fsync() without losing MNT_ASYNC. Add MNTK_ASYNC flag which is set only when MNT_ASYNC is set and mnt_noasync is zero, and check that flag instead of MNT_ASYNC before initiating async io.
# 162647	26-Sep-2006	tegge	Use mount interlock to protect all changes to mnt_flag and mnt_kern_flag. This eliminates a race where MNT_UPDATE flag could be lost when nmount() raced against sync(), sync_fsync() or quotactl().
# 162024	04-Sep-2006	pjd	Add 'show vnode <addr>' DDB command.
# 161160	10-Aug-2006	pjd	getnewvnode() can be called with NULL mp. Found by: Coverity Prevent (tm) Coverity ID: 1521 Confirmed by: phk
# 161122	09-Aug-2006	pjd	Add a bandaid to avoid a deadlock in a situation, when we are trying to suspend a file system, but need to obtain a vnode. We may not be able to do it, because all vnodes could be already in use and other processes cannot release them, because they are waiting in "suspfs" state. In such situation, we allow to allocate a vnode anyway. This is a temporary fix - there is no backpressure to free vnodes allocated in those circumstances. MFC after: 1 week Reviewed by: tegge
# 161020	06-Aug-2006	rwatson	Improve commenting of vaccess(), making sure to be clear that the ifdef capabilities code is there for reference and never actually used. Slight style tweak.
# 160378	15-Jul-2006	alc	Enable debug.mpsafevfs by default on arm. Since every architecture except powerpc has debug.mpsafevfs enabled by default, it is shorter to enumerate the architectures on which debug.mpsafevfs is off. Tested by: cognet@
# 160112	05-Jul-2006	kib	Back out my rev. 1.674. The better fix (rev. 1.637) is already in tree. Approved by: kan (mentor)
# 159964	26-Jun-2006	babkin	Backed out the change by request from rwatson. PR: kern/14584
# 159927	25-Jun-2006	babkin	The common UID/GID space implementation. It has been discussed on -arch in 1999, and there are changes to the sysctl names compared to PR, according to that discussion. The description is in sys/conf/NOTES. Lines in the GENERIC files are added in commented-out form. I'll attach the test script I've used to PR. PR: kern/14584 Submitted by: babkin
# 159392	08-Jun-2006	kib	Fix the LOR that occurs when the MAC compiled into the kernel and vnode is destroyed. Reviewed by: rwatson LOR: 189 MFC after: 2 weeks Approved by: kan (mentor)
# 158906	25-May-2006	ups	Do not set B_NOCACHE on buffers when releasing them in flushbuflist(). If B_NOCACHE is set the pages of vm backed buffers will be invalidated. However clean buffers can be backed by dirty VM pages so invalidating them can lead to data loss. Add support for flush dirty page in the data invalidation function of some network file systems. This fixes data losses during vnode recycling (and other code paths using invalbuf(,V_SAVE,,*)) for data written using an mmaped file. Collaborative effort by: jhb@,mohans@,peter@,ps@,ups@ Reviewed by: tegge@ MFC after: 7 days
# 158471	12-May-2006	jhb	Remove various bits of conditional Alpha code and fixup a few comments.
# 158151	29-Apr-2006	pjd	vn_start_write()/vn_finished_write() is not needed here, because vn_start_write() is always called earlier in the code path and calling the function recursively may lead to a deadlock. Confirmed by: tegge MFC after: 2 weeks
# 158095	28-Apr-2006	jeff	- Add a BO_NEEDSGIANT flag to the bufobj. This flag forces all child buffers to go on the buf daemon's DIRTYGIANT queue. - Set BO_NEEDSGIANT on ffs's devvp since the ffs_copyonwrite handler runs in the context of the buf daemon and may require Giant.
# 157470	04-Apr-2006	jeff	- VFS_LOCK_GIANT when recycling a vnode via getnewvnode. We may be recycling for an unrelated filesystem. I really don't like potentially acquiring giant in the context of a giantless filesystem but there are reasonable objections to removing the recycling from this path. Sponsored by: Isilon Systems, Inc.
# 157345	31-Mar-2006	jeff	- Add an assert to vgone. It is illegal to call vgone without a reference to the vnode. Without a reference the vnode will never be vdestroy'd and the memory will never be reclaimed. Sponsored by: Isilon Systems, Inc.
# 157324	31-Mar-2006	jeff	- Hold a reference from the time vfs_busy starts until vfs_unbusy is called. - vfs_getvfs has to return a reference to prevent the returned mountpoint from changing identities. - Release references acquired via vfs_getvfs. Discussed with: tegge Tested by: kris Sponsored by: Isilon Systems, Inc.
# 157319	31-Mar-2006	jeff	- Add the B_NEEDSGIANT flag which is only set if the vnode that owns a buf requires Giant. It is set in bgetvp and cleared in brelvp. - Create QUEUE_DIRTY_GIANT for dirty buffers that require giant. - In the buf daemon, only grab giant when processing QUEUE_DIRTY_GIANT and only if we think there are buffers in that queue. Sponsored by: Isilon Systems, Inc.
# 156892	19-Mar-2006	jeff	- Correct an assert in vop_rename_pre. fdvp may be locked if it is either the target directory or file. This case should fail in the filesystem anyway and perhaps kern_rename() should catch it. Sponsored by: Isilon Systems, Inc.
# 156451	08-Mar-2006	tegge	Use vn_start_secondary_write() and vn_finished_secondary_write() as a replacement for vn_write_suspend_wait() to better account for secondary write processing. Close race where secondary writes could be started after ffs_sync() returned but before the file system was marked as suspended. Detect if secondary writes or softdep processing occurred during vnode sync loop in ffs_sync() and retry the loop if needed.
# 156225	02-Mar-2006	tegge	Eliminate a deadlock when creating snapshots. Blocking vn_start_write() must be called without any vnode locks held. Remove calls to vn_start_write() and vn_finished_write() in vnode_pager_putpages() and add these calls before the vnode lock is obtained to most of the callers that don't already have them.
# 156223	02-Mar-2006	tegge	Don't try to show marker nodes.
# 156203	02-Mar-2006	jeff	- Move softdep from using a global worklist to per-mount worklists. This has many positive effects including improved smp locking, reducing interdependencies between mounts that can lead to deadlocks, etc. - Add the softdep worklist and various counters to the ufsmnt structure. - Add a mount pointer to the workitem and remove mount pointers from the various structures derived from the workitem as they are now redundant. - Remove the poor-man's semaphore protecting softdep_process_worklist and softdep_flushworklist. Several threads may now process the list simultaneously. - Add softdep_waitidle() to block the thread until all pending dependencies being operated on by other threads have been flushed. - Use softdep_waitidle() in unmount and snapshots to block either operation until the fs is stable. - Remove softdep worklist processing from the syncer and move it into the softdep_flush() thread. This thread processes all softdep mounts once each second and when it is called via the new softdep_speedup() when there is a resource shortage. This removes the softdep hook from the kernel and various hacks in header files to support it. Reviewed by/Discussed with: tegge, truckman, mckusick Tested by: kris
# 155938	23-Feb-2006	jeff	- Release the mount ref once the vnode has been recycled rather than once the last reference is dropped. I forgot that vnodes can stick around for a very long time until processes discover that they are dead. This means that a vnode reference is not sufficient to keep the mount referenced and even more code will be required to ref mount points. Discovered by: kris
# 155901	22-Feb-2006	jeff	- Grab a mnt ref in vfs_busy() before dropping the interlock. This will prevent the mount point from going away while we're waiting on the lock. The ref does not need to persist once we have the lock because the lock prevents the mount point from being unmounted. MFC After: 1 week
# 155386	06-Feb-2006	jeff	- Add a ref count to the mount structure. Sleep for up to 3 seconds in vfs_mount_destroy waiting for this ref to hit 0. We don't print an error if we are rebooting as the root mount always retains some refernces by init proc. - Acquire a mnt ref for every vnode allocated to a mount point. Drop this ref only once vdestroy() has been called and the mount has been freed. - No longer NULL the v_mount pointer in delmntque() so that we may release the ref after vgone() has been called. This allows us to guarantee that the mount point structure will be valid until the last vnode has lost its last ref. - Fix a few places that rely on checking v_mount to detect recycling. Sponsored by: Isilon Systems, Inc. MFC After: 1 week
# 155161	01-Feb-2006	jeff	- Solve a race where we could lose a call to VOP_INACTIVE. If vget() waiting on a lock held the last usecount ref on a vnode and the lock failed we would not call INACTIVE. Solve this by only holding a holdcnt to prevent the vnode from disappearing while we wait on vn_lock. Other callers may now VOP_INACTIVE while we are waiting on the lock, however this race is acceptable, while losing INACTIVE is not. Discussed with: kan, pjd Tested by: kkenn Sponsored by: Isilon Systems, Inc. MFC After: 1 week
# 154946	28-Jan-2006	kris	Back out r1.653; it turns out that the race (or at least the printf) is actually not hard to trigger, and it can cause a lot of console spam. Approved by: kan
# 154646	21-Jan-2006	rwatson	Convert remaining functions in vfs_subr.c from K&R prototypes to ANSI C prototypes, as the majority of new functions added have been in this style. Changing prototype style now results in gcc noticing that the implementation of vn_pollrecord() has a 'short' argument instead of 'int' as prototyped in vnode.h, so correct that definition. In practice this didn't matter as only poll flags in the lower 16 bits are used. MFC after: 1 week
# 154152	09-Jan-2006	tegge	Add marker vnodes to ensure that all vnodes associated with the mount point are iterated over when using MNT_VNODE_FOREACH. Reviewed by: truckman
# 153859	29-Dec-2005	pjd	Print a warning when we miss vinactive() call, because of race in vget(). The race is very real, but conditions needed for triggering it are rather hard to meet now. When gjournal will be committed (where it is quite easy to trigger) we need to fix it. For now, verify if it is really hard to trigger. Discussed with: kan
# 152254	09-Nov-2005	dwhite	This is a workaround for a complicated issue involving VFS cookies and devfs. The PR and patch have the details. The ultimate fix requires architectural changes and clarifications to the VFS API, but this will prevent the system from panicking when someone does "ls /dev" while running in a shell under the linuxulator. This issue affects HEAD and RELENG_6 only. PR: 88249 Submitted by: "Devon H. O'Dell" <dodell@ixsystems.com> MFC after: 3 days
# 151897	31-Oct-2005	rwatson	Normalize a significant number of kernel malloc type names: - Prefer '_' to ' ', as it results in more easily parsed results in memory monitoring tools such as vmstat. - Remove punctuation that is incompatible with using memory type names as file names, such as '/' characters. - Disambiguate some collisions by adding subsystem prefixes to some memory types. - Generally prefer lower case to upper case. - If the same type is defined in multiple architecture directories, attempt to use the same name in additional cases. Not all instances were caught in this change, so more work is required to finish this conversion. Similar changes are required for UMA zone names.
# 151352	14-Oct-2005	kris	mpsafevm has been stable and defaulted to 1 on sparc64 for over 6 months, so we are ready for mpsafevfs=1 by default on sparc64 too. I have been running this on all my sparc64 machines for over 6 months, and have not encountered MD problems. MFC after: 1 week
# 151252	12-Oct-2005	dds	Move execve's access time update functionality into a new vfs_mark_atime() function, and use the new function for performing efficient atime updates in mmap(). Reviewed by: bde MFC after: 2 weeks
# 150741	30-Sep-2005	truckman	Un-staticize runningbufwakeup() and staticize updateproc. Add a new private thread flag to indicate that the thread should not sleep if runningbufspace is too large. Set this flag on the bufdaemon and syncer threads so that they skip the waitrunningbufspace() call in bufwrite() rather than than checking the proc pointer vs. the known proc pointers for these two threads. A way of preventing these threads from being starved for I/O but still placing limits on their outstanding I/O would be desirable. Set this flag in ffs_copyonwrite() to prevent bufwrite() calls from blocking on the runningbufspace check while holding snaplk. This prevents snaplk from being held for an arbitrarily long period of time if runningbufspace is high and greatly reduces the contention for snaplk. The disadvantage is that ffs_copyonwrite() can start a large amount of I/O if there are a large number of snapshots, which could cause a deadlock in other parts of the code. Call runningbufwakeup() in ffs_copyonwrite() to decrement runningbufspace before attempting to grab snaplk so that I/O requests waiting on snaplk are not counted in runningbufspace as being in-progress. Increment runningbufspace again before actually launching the original I/O request. Prior to the above two changes, the system could deadlock if enough I/O requests were blocked by snaplk to prevent runningbufspace from falling below lorunningspace and one of the bawrite() calls in ffs_copyonwrite() blocked in waitrunningbufspace() while holding snaplk. See <http://www.holm.cc/stress/log/cons143.html>
# 150231	16-Sep-2005	tegge	Break out of loop if next buffer pointer has become invalid while flushing current buffer. Reviewed by: kan
# 150062	12-Sep-2005	rwatson	In vfs_kqfilter(), return EINVAL instead of 1 (EPERM) when an unsupported kqueue filter type is requested on a vnode. MFC after: 3 days
# 150047	12-Sep-2005	jkim	use monotonic `time_uptime' instead of `time_second' Approved by: anholt (mentor) Discussed on: arch
# 150020	12-Sep-2005	phk	Introduce vfs_read_dirent() which can help VOP_READDIR() implementations by handling all the cookie stuff.
# 149557	28-Aug-2005	ssouhlal	Fix a typo in vop_rename_pre() where we ended up using vholdl() instead of vhold(), even though the vnode interlock is unlocked. MFC after: 3 days
# 149385	23-Aug-2005	truckman	Back out the removal of LK_NOWAIT from the VOP_LOCK() call in vlrureclaim() in vfs_subr.c 1.636 because waiting for the vnode lock aggravates an existing race condition. It is also undesirable according to the commit log for 1.631. Fix the tiny race condition that remains by rechecking the vnode state after grabbing the vnode lock and grabbing the vnode interlock. Fix the problem of other threads being starved (which 1.636 attempted to fix by removing LK_NOWAIT) by calling uio_yield() periodically in vlrureclaim(). This should be more deterministic than hoping that VOP_LOCK() without LK_NOWAIT will block, which may not happen in this loop. Reviewed by: kan MFC after: 5 days
# 149340	20-Aug-2005	rwatson	Silence "busy" warnings when unmounting devfs at system shutdown. This is a workaround for non-symetric teardown of the file systems at shutdown with respect to the mount order at boot. The proper long term fix is to properly detach devfs from the root mount before unmounting each, and should be implemented, but since the problem is non-harmful, this temporary band-aid will prevent false positive bug reports and unnecessary error output for 6.0-RELEASE. MFC after: 3 days Tested by: pav, pjd
# 149034	13-Aug-2005	marcel	Make mpsafe_vfs=1 the default on ia64.
# 148922	10-Aug-2005	kan	Do not drop the vnode interlock if vdropl is called on already doomed vnode. vdropl callers expect it to return with interlock still being held. MFC after: 2 days
# 148768	06-Aug-2005	ssouhlal	Holding a vnode doesn't prevent v_mount from disappearing (when the vnode is inactivated), possibly leading to a NULL dereference when checking if the mount wants knotes to be activated in the VOP hooks. So, we add a new vnode flag VV_NOKNOTE that is only set in getnewvnode(), if necessary, and check it when activating knotes. Since the flags are not erased when a vnode is being held, we can safely read them. Reviewed by: kris@ MFC after: 3 days
# 148671	03-Aug-2005	jeff	- Unlock before we call mac_destroy_vnode to prevent a lock order reversal. Found by: trhodes
# 148167	20-Jul-2005	jeff	- Allow vnlru to drop giant if the filesystem does not require it. The vnlru proc is extremely inefficient, potentially iteration over tens of thousands of vnodes without blocking. Droping Giant allows other threads to preempt us although we should revisit the algorithm to fix the runtime problems especially since this may hold up all vnode allocations. - Remove the LK_NOWAIT from the VOP_LOCK in vlrureclaim. This provides a natural blocking point to help alleviate the situation described above although it may not technically be desirable. - yield after we make a pass on all mount points to prevent us from blocking other threads which require Giant. MFC after: 2 weeks
# 147772	05-Jul-2005	pjd	Fix one "wrong b_bufobj" panic in reassignbuf() by moving VI_UNLOCK(vp) below KASSERT()s, which means there was no real problem here, we just needed better locking for assertions. OK'ed by: jeff Approved by: re (scottl)
# 147730	01-Jul-2005	ssouhlal	Fix the recent panics/LORs/hangs created by my kqueue commit by: - Introducing the possibility of using locks different than mutexes for the knlist locking. In order to do this, we add three arguments to knlist_init() to specify the functions to use to lock, unlock and check if the lock is owned. If these arguments are NULL, we assume mtx_lock, mtx_unlock and mtx_owned, respectively. - Using the vnode lock for the knlist locking, when doing kqueue operations on a vnode. This way, we don't have to lock the vnode while holding a mutex, in filt_vfsread. Reviewed by: jmg Approved by: re (scottl), scottl (mentor override) Pointyhat to: ssouhlal Will be happy: everyone
# 147478	18-Jun-2005	jeff	- Try to catch the wrong bufobj panics a little earlier. I believe they are actually caused by a buf with both VNCLEAN and VNDIRTY set. In the traces it is clear that the buf is removed from the dirty queue while it is actually on the clean queue which leaves the tail pointer set. Assert that both flags are not set in buf_vlist_add and buf_vlist_remove. Sponsored by: Isilon Systems, Inc. Approved by: re (blanket vfs)
# 147407	16-Jun-2005	jeff	- Change holdcnt use around vnode recycling. We now always keep a holdcnt ref while we're calling vgone(). This prevents transient refs from re-adding us to the free list. Previously, a vfree() triggered via vinvalbuf() getting rid of all of a vnode's pages could place a partially destructed vnode on the free list where vtryrecycle() could find it. The first call to vtryrecycle would hang up on the vnode lock, but when it failed it would place a now dead vnode onto the free list, and another call to vtryrecycle() would free an already free vnode. There were many complications of having a zero ref count while freeing which can now go away. - Change vdropl() to release the interlock before returning. All callers now respect this, so vdropl() directly frees VI_DOOMED vnodes once the last ref is dropped. This means that we'll never have VI_DOOMED vnodes on the free list. - Seperate v_incr_usecount() into v_incr_usecount(), v_decr_usecount() and v_decr_useonly(). The incr/decr split is so that incr usecount can return with the interlock still held while decr drops the interlock so it can call vdropl() which will potentially free the vnode. The calling function can't drop the lock of an already free'd node. v_decr_useonly() drops a usecount without droping the hold count. This is done so the usecount reaches zero in vput() before we recycle, however the holdcount is still 1 which prevents any new references from placing the vnode back on the free list. - Fix vnlrureclaim() to vhold the vnode since it doesn't do a vget(). We wouldn't want vnlrureclaim() to bump the usecount since this has different semantics. Also change vnlrureclaim() to do a NOWAIT on the vn_lock. When this function runs we're usually in a desperate situation and we wouldn't want to wait for any specific vnode to be released. - Fix a bunch of misc comments to reflect the new behavior. - Add vhold() and vdrop() to vflush() for the same reasons that we do in vlrureclaim(). Previously we held no reference and a vnode could have been freed while we were waiting on the lock. - Get rid of vlruvp() and vfreehead(). Neither are used. vlruvp() should really be rethought before it's reintroduced. - vgonel() always returns with the vnode locked now and never puts the vnode back on a free list. The vnode will be freed as soon as the last reference is released. Sponsored by: Isilon Systems, Inc. Debugging help from: Kris Kennaway, Peter Holm Approved by: re (blanket vfs)
# 147387	14-Jun-2005	jeff	- In reassignbuf() add many asserts to validate the head and tail pointers of the clean and dirty lists. This is in an attempt to catch the wrong bufobj problem sooner. - In vgonel() don't acquire an extra reference in the active case, the vnode lock and VI_DOOMED protect us from recursively cleaning. - Also in vgonel() clean up some stale comments. Sponsored by: Isilon Systems, Inc. Approved by: re (blanket vfs)
# 147332	13-Jun-2005	jeff	- Don't make vgonel() globally visible, we want to change its prototype anyway and it's not used outside of vfs_subr.c. - Change vgonel() to accept a parameter which determines whether or not we'll put the vnode on the free list when we're done. - Use the new vgonel() parameter rather than VI_DOOMED to signal our intentions in vtryrecycle(). - In vgonel() return if VI_DOOMED is already set, this vnode has already been reclaimed. Sponsored by: Isilon Systems, Inc.
# 147327	13-Jun-2005	jeff	- Add KTR_VFS events to vdestroy, vtruncbuf, vinvalbuf, vfreehead. Sponsored by: Isilon Systems, Inc.
# 147297	11-Jun-2005	jeff	- Assert that we're not in the name cache anymore in vdestroy(). Sponsored by: Isilon Systems, Inc.
# 147290	11-Jun-2005	jeff	- Add KTR_VFS tracing to track the life of vnodes. Eventually KTR_VFS events could be added to cover other interesting details. - Add some VNASSERTs to discover places where we access vnodes after they have been uma_zfree'd before we try to free them again. - Add a few more VNASSERTs to vdestroy() to be certain that the vnode is really unused. Sponsored by: Isilon Systems, Inc.
# 147198	09-Jun-2005	ssouhlal	Allow EVFILT_VNODE events to work on every filesystem type, not just UFS by: - Making the pre and post hooks for the VOP functions work even when DEBUG_VFS_LOCKS is not defined. - Moving the KNOTE activations into the corresponding VOP hooks. - Creating a MNTK_NOKNOTE flag for the mnt_kern_flag field of struct mount that permits filesystems to disable the new behavior. - Creating a default VOP_KQFILTER function: vfs_kqfilter() My benchmarks have not revealed any performance degradation. Reviewed by: jeff, bde Approved by: rwatson, jmg (kqueue changes), grehan (mentor)
# 147113	07-Jun-2005	jeff	- Clear OWEINACT prior to calling VOP_INACTIVE to remove the possibility of a vget causing another call to INACTIVE before we're finished.
# 145953	06-May-2005	cperciva	If we are going to 1. Copy a NULL-terminated string into a fixed-length buffer, and 2. copyout that buffer to userland, we really ought to 0. Zero the entire buffer first. Security: FreeBSD-SA-05:08.kmem
# 145822	03-May-2005	jeff	- A vnode may have made its way onto the free list while it was being vgone'd. We must remove it from the freelist before returning in vtryrecycle() or we may get a duplicate free. Reported by: kkenn
# 145785	02-May-2005	csjp	Since it is not possible for curthread to be NULL in this context, drop the check+initialization for a straight initialization. Also assert that curthread will never be NULL just to be sure. Discussed with: rwatson, peter MFC after: 1 week
# 145767	01-May-2005	jeff	- All buffers should either be clean or dirty. If neither of these flags are set when we attempt to remove a buffer from a queue we should panic. Hopefully this will catch the source of the wrong bufobj panics. Sponsored by: Isilon Systems, Inc.
# 145697	30-Apr-2005	jeff	- In vnlru_free() remove the vnode from the free list before we call vtryrecycle(). We could sometimes get into situations where two threads could try to recycle the same vnode before this. - vtryrecycle() is now responsible for returning the vnode to the free list if it fails and someone else hasn't done it. - Make a new function vfreehead() which moves a vnode to the head of the free list and use it in vgone() to clean up that code a bit. Sponsored by: Isilon Systems, Inc. Reported by: pho, kkenn
# 145590	27-Apr-2005	jeff	- Don't vgonel() via vgone() or vrecycle() if the vnode is already doomed. This fixes forced unmounts via nullfs. Reported by: kkenn Sponsored by: Isilon Systems, Inc.
# 145588	27-Apr-2005	jeff	- Stop setting vxthread, we've asserted that it was useless for several weeks now.
# 145385	22-Apr-2005	jeff	- Disable code which allows getnewvnode() to fail. Many ffs_vget() callers do not correctly deal with failures. This presently risks deadlock problems if dependency processing is held up by failures to allocate a vnode, however, this is better than the situation with the failures. Sponsored by: Isilon Systems, Inc.
# 145249	18-Apr-2005	phk	Initialize mountlist_mtx with an MTX_SYSINIT(), we need it to be ready earlier.
# 145005	13-Apr-2005	jeff	- Change vop_lookup_post assertions to reflect recent vfs_lookup changes. Sponsored by: Isilon Systems, Inc.
# 144908	11-Apr-2005	jeff	- Enable ASSERT_VOP_ELOCKED and assert_vop_elocked() now that vnode_if.awk uses it. Sponsored by: Isilon Systems, Inc.
# 144900	11-Apr-2005	jeff	- Change the VOP_LOCK UPGRADE in vput() to do a LK_NOWAIT to avoid a potential lock order reversal. Also, don't unlock the vnode if this fails, lockmgr has already unlocked it for us. - Restructure vget() now that vn_lock() does all of VI_DOOMED checking for us and also handles the case where there is no real lock type. - If VI_OWEINACT is set, we need to upgrade the lock request to EXCLUSIVE so that we can call inactive. It's not legal to vget a vnode that hasn't had INACTIVE called yet. Sponsored by: Isilon Systems, Inc.
# 144704	06-Apr-2005	jeff	- Assert that the bufobj matches in flushbuflists. I still haven't gotten to root cause on exactly how this happens. - If the assert is disabled, we presently try to handle this case, but the BUF_UNLOCK was missing. Thus, if this condition ever hit we would leak a buf lock. Many thanks to Peter Holm for all his help in finding this bug. He really put more effort into it than I did.
# 144661	05-Apr-2005	jeff	- Move NDFREE() from vfs_subr to vfs_lookup where namei() is.
# 144625	04-Apr-2005	jeff	- Add a missing unlock of the vnode_free_list_mtx. Spotted by: Antoine Brodin
# 144624	04-Apr-2005	jeff	- Instead of waiting forever to get a vnode in getnewvnode() wait for one to become available for one second and then return ENFILE. We can run out of vnodes, and there must be a hard limit because without one we can quickly run out of KVA on x86. Presently the system can deadlock if there are maxvnodes directories in the namecache. The original 4.x BSD behavior was to return ENFILE if we reached the max, but 4.x BSD did not have the vnlru proc so it was less profitable to wait.
# 144374	31-Mar-2005	jeff	- Disable vfs shared locks by default. They must be specifically enabled on filesystems which safely support them. It appears that many network filesystems specifically are not shared lock safe. Sponsored by: Isilon Systems, Inc.
# 144367	31-Mar-2005	jeff	- LK_NOPAUSE is a nop now. Sponsored by: Isilon Systems, Inc.
# 144319	30-Mar-2005	das	Eliminate v_id and v_ddid. The name cache now holds references to vnodes whose names it caches, so we no longer need a `generation number' to tell us if a referenced vnode is invalid. Replace the use of the parent's v_id in the hash function with the address of the parent vnode. Tested by: Peter Holm Glanced at by: jeff, phk
# 144284	29-Mar-2005	jeff	- Dont clear OWEINACT in vbusy(), we still owe an inactive call if someone vhold()s us. - Avoid an extra mutex acquire and release in the common case of vgonel() by checking for OWEINACT at the start of the function. - Fix the case where we set OWEINACT in vput(). LK_EXCLUPGRADE drops our shared lock if it fails. Sponsored by: Isilon Systems, Inc.
# 144283	29-Mar-2005	jeff	- Don't initial v_dd here, let cache_purge() do it for us. Sponsored by: Isilon Systems, Inc.
# 144219	28-Mar-2005	jeff	- Move code that should probably be an assert above the main body of vrele so that we can decrease the indentation of the real work and make things slightly more clear. Sponsored by: Isilon Systems, Inc.
# 144204	28-Mar-2005	jeff	- Adjust asserts in vop_lookup_post() to match the new post PDIRUNLOCK vfs. Sponsored by: Isilon Systems, Inc.
# 144173	27-Mar-2005	phk	Remove another ';' after if(). Also spotted by: bz
# 144172	27-Mar-2005	phk	Remove extra ; at end of if(). Found by: bz
# 144092	25-Mar-2005	jeff	- Don't recycle vnodes anymore. Free them once they are dead. getnewvnode now always allocates a new vnode. - Define a new function, vnlru_free, which frees vnodes from the free list. It takes as a parameter the number of vnodes to free, which is wantfreevnodes - freevnodes when called from vnlru_proc or 1 when called from getnewvnode(). For now, getnewvnode() still tries to reclaim a free vnode before creating a new one when we are near the limit. - Define a function, vdestroy, which handles the actual release of memory and teardown of locks, etc. This could become a uma_dtor() routine. - Get rid of minvnodes. Now wantfreevnodes is 1/4th the max vnodes. This keeps more unreferenced vnodes around so that files which have only been stat'd are less likely to be kicked out of the system before we have a chance to read them, etc. These vnodes may still be freed via the normal vnlru_proc() routines which may some day become a real lru.
# 144055	24-Mar-2005	jeff	- Pass LK_EXCLUSIVE to VFS_ROOT() to satisfy the new flags argument. For now, all calls to VFS_ROOT() should still acquire exclusive locks. Sponsored by: Isilon Systems, Inc.
# 144051	24-Mar-2005	jeff	- If vput() is called with a shared lock it must upgrade to an exclusive before it can call VOP_INACTIVE(). This must use the EXCLUPGRADE path because we may violate some lock order with another locked vnode if we drop and reacquire the lock. If EXCLUPGRADE fails, we mark the vnode with VI_OWEINACT. This case should be very rare. - Clear VI_OWEINACT in vinactive() and vbusy(). - If VI_OWEINACT is set in vgone() do the VOP_INACTIVE call here as well. Sponsored by: Isilon Systems, Inc.
# 143652	15-Mar-2005	jeff	- Now that there are no external users of vfree() make it static. - Move VSHOULDBUSY, VSHOULDFREE, and VTRYRECYCLE into vfs_subr.c so no one else attempts to grow a dependency on them. - Now that objects with pages hold the vnode we don't have to do unlocked checks for the page count in the vm object in VSHOULDFREE. These three macros could simply check for holdcnt state transitions to determine whether the vnode is on the free list already, but the extra safety the flag affords us is probably worth the minimal cost. - The leafonly sysctl and code have been dead for several years now, remove the sysctl and the code that employed it from vtryrecycle(). - vtryrecycle() also no longer has to check the object's page count as the object holds the vnode until it reaches 0. Sponsored by: Isilon Systems, Inc.
# 143640	15-Mar-2005	jeff	- Expose vholdl() so it may be used outside of vfs_subr.c
# 143560	14-Mar-2005	jeff	- Increment the holdcnt once for each usecount reference. This allows us to use only the holdcnt to determine whether a vnode may be recycled, simplifying the V* macros as well as vtryrecycle(), etc. Sponsored by: Isilon Systems, Inc.
# 143557	14-Mar-2005	jeff	- We do not have to check the object's ref_count in VSHOULDFREE or vtryrecycle(). All obj refs also ref the vnode. - Consistently use v_incr_usecount() to increment the usecount. This will be more important later. Sponsored by: Isilon Systems, Inc.
# 143552	14-Mar-2005	jeff	- Slightly rearrange vrele() to move the common case in one indentation level. Sponsored by: Isilon Systems, Inc.
# 143551	14-Mar-2005	jeff	- Rework vget() so we drop the usecount in two failure cases that were missed by my last commit. Sponsored by: Isilon Systems, Inc.
# 143497	13-Mar-2005	jeff	- Remove vx_lock, vx_unlock, vx_wait, etc. - Add a vn_start_write/vn_finished_write around vlrureclaim so we don't do writing ops without suspending. This could suspend the vlruproc which should not be a problem under normal circumstances. - Manually implement VMIGHTFREE in vlrureclaim as this was the only instance where it was used. - Acquire a lock before calling vgone() as it now requires it. - Move the acquisition of the vnode interlock from vtryrecycle() to getnewvnode() so that if it fails we don't drop and reacquire the vnode_free_list_mtx. - Check for a usecount or holdcount at the end of vtryrecycle() in case someone grabbed a ref while we were recycling. Abort the recycle, and on the final ref drop this vnode will be placed on the head of the free list. - Move the redundant VOP_INACTIVE protection code into the local vinactive() routine to avoid code bloat. - Keep the vnode lock held across calls to vgone() in several places. - vgonel() no longer uses XLOCK, instead callers must hold an exclusive vnode lock. The VI_DOOMED flag is set to allow other threads to detect a vnode which is no longer valid. This flag is set until the last reference is gone, and there are no chances for a new ref. vgonel() holds this lock across the entire function, which greatly simplifies logic. _ Only vfree() in one place in vgone() not three. - Adjust vget() to check the VI_DOOMED flag prior to waiting on the lock in the LK_NOWAIT case. In other cases, check after we have slept and acquired an exlusive lock. This will simulate the old vx_wait() behavior. Sponsored by: Isilon Systems, Inc.
# 142296	23-Feb-2005	jeff	- Enable SMP VFS by default on current. More users are needed to turn up any remaining bugs. Anyone inconvenienced by this can still disable it in the loader. Sponsored by: Isilon Systems, Inc.
# 142265	23-Feb-2005	jeff	- Only the xlock holder should be calling VOP_LOCK on a vp once VI_XLOCK has been set. Assert that this is the case so that we catch filesystems who are using naked VOP_LOCKs in illegal cases. Sponsored by: Isilon Systems, Inc.
# 142264	22-Feb-2005	jeff	- Add a check for xlock in vop_lock_assert. Presently the xlock is considered to be as good as an exclusive lock, although there is still a possibility of someone acquiring a VOP LOCK while xlock is held. Sponsored by: Isilon Systems, Inc.
# 142252	22-Feb-2005	phk	Zero the v_un container field to make sure everything is gone.
# 142242	22-Feb-2005	phk	Reap more benefits from DEVFS: List devfs_dirents rather than vnodes off their shared struct cdev, this saves a pointer field in the vnode at the expense of a field in the devfs_dirent. There are often 100 times more vnodes so this is bargain. In addition it makes it harder for people to try to do stypid things like "finding the vnode from cdev". Since DEVFS handles all VCHR nodes now, we can do the vnode related cleanup in devfs_reclaim() instead of in dev_rel() and vgonel(). Similarly, we can do the struct cdev related cleanup in dev_rel() instead of devfs_reclaim(). rename idestroy_dev() to destroy_devl() for consistency. Add LIST_ENTRY de_alias to struct devfs_dirent. Remove v_specnext from struct vnode. Change si_hlist to si_alist in struct cdev. String new devfs vnodes' devfs_dirent on si_alist when we create them and take them off in devfs_reclaim(). Fix devfs_revoke() accordingly. Also don't clear fields devfs_reclaim() will clear when called from vgone(); Let devfs_reclaim() call dev_rel() instead of vgonel(). Move the usecount tracking from dev_rel() to devfs_reclaim(), and let dev_rel() take a struct cdev argument instead of vnode. Destroy SI_CHEAPCLONE devices in dev_rel() (instead of devfs_reclaim()) when they are no longer used. (This should maybe happen in devfs_close() instead.)
# 142225	22-Feb-2005	phk	Remove vfinddev(), it is generally bogus when faced with jails and chroot and has no legitimate use(r)s in the tree.
# 142079	19-Feb-2005	phk	Try to unbreak the vnode locking around vop_reclaim() (based mostly on patch from kan@). Pull bufobj_invalbuf() out of vinvalbuf() and make g_vfs call it on close. This is not yet a generally safe function, but for this very specific use it is safe. This solves the problem with buffers not being flushed by unmount or after failed mount attempts.
# 142042	18-Feb-2005	phk	Make sure to drop the VI_LOCK in vgonel(); Spotted by: Taku YAMAMOTO <taku@tackymt.homeip.net>
# 142011	17-Feb-2005	phk	Introduce vx_wait{l}() and use it instead of home-rolled versions.
# 142010	17-Feb-2005	phk	Convert KASSERTS to VNASSERTS
# 141637	10-Feb-2005	phk	Make various vnode related functions static
# 141605	10-Feb-2005	phk	Don't pass NULL to vprint()
# 141547	08-Feb-2005	jeff	- Add a new assert in the getnewvnode(). Assert that the usecount is still 0 to detect getnewvnode() races. - Add the vnode address to a few panics near by to help in debugging. Sponsored by: Isilon Systems, Inc.
# 141450	07-Feb-2005	phk	Access vmobject via the bufobj instead of the vnode
# 141435	07-Feb-2005	phk	Don't call VOP_DESTROYVOBJECT(), trust that VOP_RECLAIM() did what was necessary.
# 140936	28-Jan-2005	phk	Remove unused argument to vrecycle()
# 140935	28-Jan-2005	phk	Integrate vclean() into vgonel(). Various associated polishing.
# 140934	28-Jan-2005	phk	Remove register keyword
# 140782	25-Jan-2005	phk	Don't use VOP_GETVOBJECT, use vp->v_object directly.
# 140772	24-Jan-2005	phk	Eliminate the constant flags argument to vclean()
# 140739	24-Jan-2005	phk	Change vprint() to vn_printf() which takes varargs. Add #define for vprint() to call vn_printf().
# 140734	24-Jan-2005	phk	Kill the VV_OBJBUF and test the v_object for NULL instead.
# 140720	24-Jan-2005	jeff	- Add the tunable and sysctl for the mpsafevfs. It currently defaults to off. - Protect access to mnt_kern_flag with the mointpoint mutex. - Remove some KASSERTs which are not legal checks without the appropriate locks held. - Use VCANRECYCLE() rather than rolling several slightly different checks together. - Return from vtryrecycle() with a recycled vnode rather than a locked vnode. This simplifies some locking. - Remove several GIANT_REQUIRED lines. - Add a few KASSERTs to help with INACT debugging. Sponsored By: Isilon Systems, Inc.
# 140360	16-Jan-2005	phk	Fix a bug I introduced in 1.561 which has caused considerable filesystem unhappiness lately. As far as I can tell, no files that have made it safely to disk have been endangered, but stuff in transit has been in peril. Pointy hat: phk
# 140220	14-Jan-2005	phk	Eliminate unused and unnecessary "cred" argument from vinvalbuf()
# 140181	13-Jan-2005	phk	Ditch vfs_object_create() and make the callers call VOP_CREATEVOBJECT() directly.
# 140056	11-Jan-2005	phk	Add BO_SYNC() and add a default which uses the secret vnode pointer and VOP_FSYNC() for now.
# 140053	11-Jan-2005	phk	More vnode -> bufobj migration.
# 140052	11-Jan-2005	phk	Give flushbuflist() a struct bufv as first argument and avoid home-rolling TAILQ_FOREACH_SAFE(). Loose the error pointer argument and return any errors the normal way. Return EAGAIN for the case where more work needs to be done.
# 140048	11-Jan-2005	phk	Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC(). I'm not sure why a credential was added to these in the first place, it is not used anywhere and it doesn't make much sense: The credentials for syncing a file (ability to write to the file) should be checked at the system call level. Credentials for syncing one or more filesystems ("none") should be checked at the system call level as well. If the filesystem implementation needs a particular credential to carry out the syncing it would logically have to the cached mount credential, or a credential cached along with any delayed write data. Discussed with: rwatson
# 139804	06-Jan-2005	imp	/* -> /*- for copyright notices, minor format tweaks as necessary
# 139665	04-Jan-2005	phk	Since we do not support forceful unmount of DEVFS we can do away with the partially implemented vnode-readoption code in vgonechrl().
# 139085	20-Dec-2004	phk	We can only ever get to vgonechrl() from a devfs vnode, so we do not need to reassign the vp->v_op to devfs_specops, we know that is the value already. Make devfs_specops private to devfs.
# 138509	07-Dec-2004	phk	The remaining part of nmount/omount/rootfs mount changes. I cannot sensibly split the conversion of the remaining three filesystems out from the root mounting changes, so in one go: cd9660: Convert to nmount. Add omount compat shims. Remove dedicated rootfs mounting code. Use vfs_mountedfrom() Rely on vfs_mount.c calling VFS_STATFS() nfs(client): Convert to nmount (the simple way, mount_nfs(8) is still necessary). Add omount compat shims. Drop COMPAT_PRELITE2 mount arg compatibility. ffs: Convert to nmount. Add omount compat shims. Remove dedicated rootfs mounting code. Use vfs_mountedfrom() Rely on vfs_mount.c calling VFS_STATFS() Remove vfs_omount() method, all filesystems are now converted. Remove MNTK_WANTRDWR, handling RO/RW conversions is a filesystem task, and they all do it now. Change rootmounting to use DEVFS trampoline: vfs_mount.c: Mount devfs on /. Devfs needs no 'from' so this is clean. symlink /dev to /. This makes it possible to lookup /dev/foo. Mount "real" root filesystem on /. Surgically move the devfs mountpoint from under the real root filesystem onto /dev in the real root filesystem. Remove now unnecessary getdiskbyname(). kern_init.c: Don't do devfs mounting and rootvnode assignment here, it was already handled by vfs_mount.c. Remove now unused bdevvp(), addaliasu() and addalias(). Put the few necessary lines in devfs where they belong. This eliminates the second-last source of bogo vnodes, leaving only the lemming-syncer. Remove rootdev variable, it doesn't give meaning in a global context and was not trustworth anyway. Correct information is provided by statfs(/).
# 138344	03-Dec-2004	phk	Improve vprint() a little bit: break long lines, reduce indent and tell if the VI_LOCK() is held.
# 138290	01-Dec-2004	phk	Back when VOP_* was introduced, we did not have new-style struct initializations but we did have lofty goals and big ideals. Adjust to more contemporary circumstances and gain type checking. Replace the entire vop_t frobbing thing with properly typed structures. The only casualty is that we can not add a new VOP_ method with a loadable module. History has not given us reason to belive this would ever be feasible in the the first place. Eliminate in toto VOCALL(), vop_t, VNODEOP_SET() etc. Give coda correct prototypes and function definitions for all vop_()s. Generate a bit more data from the vnode_if.src file: a struct vop_vector and protype typedefs for all vop methods. Add a new vop_bypass() and make vop_default be a pointer to another struct vop_vector. Remove a lot of vfs_init since vop_vector is ready to use from the compiler. Cast various vop_mumble() to void * with uppercase name, for instance VOP_PANIC, VOP_NULL etc. Implement VCALL() by making vdesc_offset the offsetof() the relevant function pointer in vop_vector. This is disgusting but since the code is generated by a script comparatively safe. The alternative for nullfs etc. would be much worse. Fix up all vnode method vectors to remove casts so they become typesafe. (The bulk of this is generated by scripts)
# 137721	15-Nov-2004	phk	Move pbgetvp() and pbrelvp() to vm_pager.c with the rest of the pbuf stuff.
# 137690	14-Nov-2004	phk	Move the bit of the syncer which deals with vnodes into a separate function.
# 137680	13-Nov-2004	phk	Eliminate vop_revoke() function now that devfs_revoke() does the entire job.
# 137508	10-Nov-2004	phk	Slim vnodes by another four bytes by eliminating the (now) unused field v_cachedid.
# 137506	10-Nov-2004	phk	Remove vn_todev()
# 137483	09-Nov-2004	phk	Remove vnode->v_cachedfs. It was only used for the highly dangerous "export all vnodes with a sysctl" function.
# 137186	04-Nov-2004	phk	Remove buf->b_dev field.
# 137170	03-Nov-2004	phk	Always initialize bo_private along with bo_ops in getnewvnode(). Spotted by: tegge
# 137050	29-Oct-2004	phk	Loose vfs_mountedon()
# 137033	29-Oct-2004	phk	Give the bufobj a private __bo_vnode for now to keep the syncer floating [1] At some point later the syncer will unlearn about vnodes and the filesystems method called by the syncer will know enough about what's in bo_private to do the right thing. [1] Ok, I know, but I couldn't resist the pun.
# 136992	27-Oct-2004	phk	Move the syncer linkage from vnode to bufobj. This is not quite a perfect separation: the syncer still think it knows that everything is a vnode.
# 136966	26-Oct-2004	phk	Put the I/O block size in bufobj->bo_bsize. We keep si_bsize_phys around for now as that is the simplest way to pull the number out of disk device drivers in devfs_open(). The correct solution would be to do an ioctl(DIOCGSECTORSIZE), but the point is probably mooth when filesystems sit on GEOM, so don't bother for now.
# 136943	25-Oct-2004	phk	Loose the v_dirty* and v_clean* alias macros. Check the count field where we just want to know the full/empty state, rather than using TAILQ_EMPTY() or TAILQ_FIRST().
# 136941	25-Oct-2004	phk	Remove vnode->v_bsize. This was a dead-end.
# 136938	25-Oct-2004	phk	Collapse vnode->v_object and buf->b_object into bufobj->bo_object.
# 136927	24-Oct-2004	phk	Move the buffer method vector (buf->b_op) to the bufobj. Extend it with a strategy method. Add bufstrategy() which do the usual VOP_SPECSTRATEGY/VOP_STRATEGY song and dance. Rename ibwrite to bufwrite(). Move the two NFS buf_ops to more sensible places, add bufstrategy to them. Add inlines for bwrite() and bstrategy() which calls through buf->b_bufobj->b_ops->b_{write,strategy}(). Replace almost all VOP_STRATEGY()/VOP_SPECSTRATEGY() calls with bstrategy().
# 136771	22-Oct-2004	rwatson	When MAC is enabled, warn if getnewvnode() is asked to produce a vnode without a mountpoint. In this scenario, there's no useful source for a label on the vnode, since we can't query the mountpoint for the labeling strategy or default label.
# 136770	22-Oct-2004	phk	Alas, poor SPECFS! -- I knew him, Horatio; A filesystem of infinite jest, of most excellent fancy: he hath taught me lessons a thousand times; and now, how abhorred in my imagination it is! my gorge rises at it. Here were those hacks that I have curs'd I know not how oft. Where be your kludges now? your workarounds? your layering violations, that were wont to set the table on a roar? Move the skeleton of specfs into devfs where it now belongs and bury the rest.
# 136767	22-Oct-2004	phk	Add b_bufobj to struct buf which eventually will eliminate the need for b_vp. Initialize b_bufobj for all buffers. Make incore() and gbincore() take a bufobj instead of a vnode. Make inmem() local to vfs_bio.c Change a lot of VI_[UN]LOCK(bp->b_vp) to BO_[UN]LOCK(bp->b_bufobj) also VI_MTX() to BO_MTX(), Make buf_vlist_add() take a bufobj instead of a vnode. Eliminate other uses of bp->b_vp where bp->b_bufobj will do. Various minor polishing: remove "register", turn panic into KASSERT, use new function declarations, TAILQ_FOREACH_SAFE() etc.
# 136751	21-Oct-2004	phk	Move the VI_BWAIT flag into no bo_flag element of bufobj and call it BO_WWAIT Add bufobj_wref(), bufobj_wdrop() and bufobj_wwait() to handle the write count on a bufobj. Bufobj_wdrop() replaces vwakeup(). Use these functions all relevant places except in ffs_softdep.c where the use if interlocked_sleep() makes this impossible. Rename b_vnbufs to b_bobufs now that we touch all the relevant files anyway.
# 136750	21-Oct-2004	phk	Add BO_* macros parallel to VI_* macros for manipulating the bo_mtx. Initialize the bo_mtx when we allocate a vnode i getnewvnode() For now we point to the vnodes interlock mutex, that retains the exact same locking sematics. Move v_numoutput from vnode to bufobj. Add renaming macro to postpone code sweep.
# 136749	21-Oct-2004	phk	Polish vtruncbuf() to improve readability and style a bit.
# 136747	21-Oct-2004	phk	Simplify buf_vlist_remove(). Now that we have encapsulated the splaytree related information into a structure we can eliminate the half of this function.
# 136180	06-Oct-2004	grog	vtryrecycle: Don't rely on type VBAD alone to mean that we don't need to clean the vnode. If v_data is set, we still need to clean it. This code change should catch all incidents of the previous commit (INVARIANTS only).
# 136179	06-Oct-2004	grog	getnewvnode: Weaken the panic "cleaned vnode isn't" to a warning. Discussion: this panic (or waning) only occurs when the kernel is compiled with INVARIANTS. Otherwise the problem (which means that the vp->v_data field isn't NULL, and represents a coding error and possibly a memory leak) is silently ignored by setting it to NULL later on. Panicking here isn't very helpful: by this time, we can only find the symptoms. The panic occurs long after the reason for "not cleaning" has been forgotten; in the case in point, it was the result of severe file system corruption which left the v_type field set to VBAD. That issue will be addressed by a separate commit.
# 136014	01-Oct-2004	phk	Fix a LOR relating to freeing cdevs.
# 135708	24-Sep-2004	phk	Hold dev_lock and check for NULL devsw pointer when we determine if a vnode is a disk.
# 135600	23-Sep-2004	phk	Do not refcount the cdevsw, but rather maintain a cdev->si_threadcount of the number of threads which are inside whatever is behind the cdevsw for this particular cdev. Make the device mutex visible through dev_lock() and dev_unlock(). We may want finer granularity later. Replace spechash_mtx use with dev_lock()/dev_unlock().
# 135280	15-Sep-2004	phk	Remove unused B_WRITEINPROG flag
# 134899	07-Sep-2004	phk	Create simple function init_va_filerev() for initializing a va_filerev field. Replace three instances of longhaired initialization va_filerev fields. Added XXX comment wondering why we don't use random bits instead of uptime of the system for this purpose.
# 134091	20-Aug-2004	truckman	Don't attempt to trigger the syncer thread final sync code in the shutdown_pre_sync state if the RB_NOSYNC flag is set. This is the likely cause of hangs after a system panic that are keeping crash dumps from being done. This is a MFC candidate for RELENG_5. MFC after: 3 days
# 133826	16-Aug-2004	obrien	s/MAX_SAFE_MAXVNODES/MAXVNODES_MAX/g
# 133741	15-Aug-2004	jmg	Add locking to the kqueue subsystem. This also makes the kqueue subsystem a more complete subsystem, and removes the knowlege of how things are implemented from the drivers. Include locking around filter ops, so a module like aio will know when not to be unloaded if there are outstanding knotes using it's filter ops. Currently, it uses the MTX_DUPOK even though it is not always safe to aquire duplicate locks. Witness currently doesn't support the ability to discover if a dup lock is ok (in some cases). Reviewed by: green, rwatson (both earlier versions)
# 133462	11-Aug-2004	rwatson	In v_addpollinfo(), we allocate storage to back vp->v_pollinfo. However, we may sleep when doing so; check that we didn't race with another thread allocating storage for the vnode after allocation is made to a local pointer, and only update the vnode pointer if it's still NULL. Otherwise, accept that another thread got there first, and release the local storage. Discussed with: jmg
# 133418	10-Aug-2004	njl	Skip the syncing disks loop if there are no dirty buffers. Remove a variable used to flag the initial printf. Submitted by: truckman (earlier version)
# 133038	02-Aug-2004	obrien	Put a cap on the auto-tuning of kern.maxvnodes. Cap value chosen by: scottl
# 132866	30-Jul-2004	njl	Minor message cleanup.
# 132710	27-Jul-2004	phk	Convert the vfsconf list to a TAILQ. Introduce vfs_byname() function to find things on it. Staticize vfs_nmount() function under the name vfs_donmount(). Various cleanups.
# 132653	26-Jul-2004	cperciva	Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is somewhat clearer, but more importantly allows for a consistent naming scheme for suser_cred flags. The old name is still defined, but will be removed in a few days (unless I hear any complaints...) Discussed with: rwatson, scottl Requested by: jhb
# 132640	25-Jul-2004	phk	Eliminate unused second argument to reassignbuf() and simplify it accordingly.
# 132489	21-Jul-2004	alfred	put several of the options for DEBUG_VFS_LOCKS under control of sysctls.
# 132197	15-Jul-2004	alfred	Cleanup shutdown output.
# 132177	15-Jul-2004	alfred	Tidy up system shutdown.
# 132023	12-Jul-2004	alfred	Make VFS_ROOT() and vflush() take a thread argument. This is to allow filesystems to decide based on the passed thread which vnode to return. Several filesystems used curthread, they now use the passed thread.
# 132007	12-Jul-2004	alfred	Dump the actual bad values when this assertion is tripped.
# 131934	10-Jul-2004	marcel	Update for the KDB framework: o Call kdb_enter() instead of Debugger().
# 131784	08-Jul-2004	alfred	fixup sysctl by fsid node
# 131695	06-Jul-2004	alfred	Introduce vfs_suser(), used to test if a user should have special privs for a mount.
# 131691	06-Jul-2004	alfred	NFS mobility PHASE I, II & III (phase VI, and V pending): Rebind the client socket when we experience a timeout. This fixes the case where our IP changes for some reason. Signal a VFS event when NFS transitions from up to down and vice versa. Add a placeholder vfs_sysctl where we will put status reporting shortly. Also: Make down NFS mounts return EIO instead of EINTR when there is a soft timeout or force unmount in progress.
# 131650	05-Jul-2004	truckman	Unconditionally set last_work_seen while in the SYNCER_RUNNING state so that last_work_seen has a reasonable value at the transition to the SYNCER_SHUTTING_DOWN state, even if net_worklist_len happened to be zero at the time. Initialize last_work_seen to zero as a safety measure in case the syncer never ran in the SYNCER_RUNNING state. Tested by: phk
# 131602	05-Jul-2004	truckman	Rework syncer termination code: Speed up the syncer when shutting down by sleeping for a shorter period of time instead of cranking up rushjob and using the normal one second sleep. Skip empty worklist slots when shutting down to avoid lengthy intervals of inactivity. Give I/O more time to complete between steps by not speeding the syncer quite as much. Terminate the syncer after one full pass through the worklist plus one second with the worklist containing nothing but syncer vnodes. Print an indication of shutdown progress to the console. Add a sysctl, vfs.worklist_len, to allow the size of the syncer worklist to be monitored.
# 131598	04-Jul-2004	phk	Give synthetic root filesystem device vnodes a v_bsize of DEV_BSIZE.
# 131593	04-Jul-2004	alfred	Pass the operation in with the fsidctl. Remove some fsidctls that we will not be using. Correct prototypes for fs sysctls.
# 131590	04-Jul-2004	phk	Make the last commit handle non-phk root devices better.
# 131565	04-Jul-2004	phk	Blocksize for I/O should be a property of the vnode and not found by groping around in the vnodes surroundings when we allocate a block. Assign a blocksize when we create a vnode, and yell a warning (and ignore it) if we got the wrong size. Please email all such warnings to me.
# 131562	04-Jul-2004	alfred	Introduce a new kevent filter. EVFILT_FS that will be used to signal generic filesystem events to userspace. Currently only mount and unmount of filesystems are signalled. Soon to be added, up/down status of NFS. Introduce a sysctl node used to route requests to/from filesystems based on filesystem ids. Introduce a new vfsop, vfs_sysctl(mp, req) that is used as the callback/ entrypoint by the sysctl code to change individual filesystems.
# 131559	04-Jul-2004	alfred	Revision 1.496 would not boot on my system due to ffs_mount -> bdevvp -> getnewvnode(..., mp = NULL, ...) -> insmntqueue(vp, mp = NULL) -> KASSERT -> panic Make getnewvnode() only call insmntqueue() if the mountpoint parameter is not NULL.
# 131551	04-Jul-2004	phk	When we traverse the vnodes on a mountpoint we need to look out for our cached 'next vnode' being removed from this mountpoint. If we find that it was recycled, we restart our traversal from the start of the list. Code to do that is in all local disk filesystems (and a few other places) and looks roughly like this: MNT_ILOCK(mp); loop: for (vp = TAILQ_FIRST(&mp...); (vp = nvp) != NULL; nvp = TAILQ_NEXT(vp,...)) { if (vp->v_mount != mp) goto loop; MNT_IUNLOCK(mp); ... MNT_ILOCK(mp); } MNT_IUNLOCK(mp); The code which takes vnodes off a mountpoint looks like this: MNT_ILOCK(vp->v_mount); ... TAILQ_REMOVE(&vp->v_mount->mnt_nvnodelist, vp, v_nmntvnodes); ... MNT_IUNLOCK(vp->v_mount); ... vp->v_mount = something; (Take a moment and try to spot the locking error before you read on.) On a SMP system, one CPU could have removed nvp from our mountlist but not yet gotten to assign a new value to vp->v_mount while another CPU simultaneously get to the top of the traversal loop where it finds that (vp->v_mount != mp) is not true despite the fact that the vnode has indeed been removed from our mountpoint. Fix: Introduce the macro MNT_VNODE_FOREACH() to traverse the list of vnodes on a mountpoint while taking into account that vnodes may be removed from the list as we go. This saves approx 65 lines of duplicated code. Split the insmntque() which potentially moves a vnode from one mount point to another into delmntque() and insmntque() which does just what the names say. Fix delmntque() to set vp->v_mount to NULL while holding the mountpoint lock.
# 131428	01-Jul-2004	truckman	When shutting down the syncer kernel thread, first tell it to run faster and iterate to over its work list a few times in an attempt to empty the work list before the syncer terminates. This leaves fewer dirty blocks to be written at the "syncing disks" stage and keeps the the "giving up on N buffers" problem from being triggered by the presence of a large soft updates work list at system shutdown time. The downside is that the syncer takes noticeably longer to terminate. Tested by: "Arjan van Leeuwen" <avleeuwen AT piwebs DOT com> Approved by: mckusick
# 130640	17-Jun-2004	phk	Second half of the dev_t cleanup. The big lines are: NODEV -> NULL NOUDEV -> NODEV udev_t -> dev_t udev2dev() -> findcdev() Various minor adjustments including handling of userland access to kernel space struct cdev etc.
# 130585	16-Jun-2004	phk	Do the dreaded s/dev_t/struct cdev */ Bump __FreeBSD_version accordingly.
# 130470	14-Jun-2004	phk	Remove a left over from userland buffer-cache access to disks.
# 129898	31-May-2004	rwatson	Assert Giant in vrele().
# 128141	11-Apr-2004	mux	Put deprecated sysctl code inside BURN_BRIDGES.
# 127911	05-Apr-2004	imp	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999. Approved by: core
# 127593	29-Mar-2004	peter	Kill some XXXKSE's. vnlru/syncer are single threaded.
# 126853	11-Mar-2004	phk	Properly vector all bwrite() and BUF_WRITE() calls through the same path and s/BUF_WRITE()/bwrite()/ since it now does the same as bwrite().
# 126851	11-Mar-2004	phk	Remove unused second arg to vfinddev(). Don't call addaliasu() on VBLK nodes.
# 126683	06-Mar-2004	kan	Always call vn_finished_write after vn_start_write was called. All occurences of 'goto done' after vn_start_write invocation were cleaning up incompletely.
# 126326	27-Feb-2004	jhb	Switch the sleep/wakeup and condition variable implementations to use the sleep queue interface: - Sleep queues attempt to merge some of the benefits of both sleep queues and condition variables. Having sleep qeueus in a hash table avoids having to allocate a queue head for each wait channel. Thus, struct cv has shrunk down to just a single char * pointer now. However, the hash table does not hold threads directly, but queue heads. This means that once you have located a queue in the hash bucket, you no longer have to walk the rest of the hash chain looking for threads. Instead, you have a list of all the threads sleeping on that wait channel. - Outside of the sleepq code and the sleep/cv code the kernel no longer differentiates between cv's and sleep/wakeup. For example, calls to abortsleep() and cv_abort() are replaced with a call to sleepq_abort(). Thus, the TDF_CVWAITQ flag is removed. Also, calls to unsleep() and cv_waitq_remove() have been replaced with calls to sleepq_remove(). - The sched_sleep() function no longer accepts a priority argument as sleep's no longer inherently bump the priority. Instead, this is soley a propery of msleep() which explicitly calls sched_prio() before blocking. - The TDF_ONSLEEPQ flag has been dropped as it was never used. The associated TDF_SET_ONSLEEPQ and TDF_CLR_ON_SLEEPQ macros have also been dropped and replaced with a single explicit clearing of td_wchan. TD_SET_ONSLEEPQ() would really have only made sense if it had taken the wait channel and message as arguments anyway. Now that that only happens in one place, a macro would be overkill.
# 126253	26-Feb-2004	truckman	Split the mlock() kernel code into two parts, mlock(), which unpacks the syscall arguments and does the suser() permission check, and kern_mlock(), which does the resource limit checking and calls vm_map_wire(). Split munlock() in a similar way. Enable the RLIMIT_MEMLOCK checking code in kern_mlock(). Replace calls to vslock() and vsunlock() in the sysctl code with calls to kern_mlock() and kern_munlock() so that the sysctl code will obey the wired memory limits. Nuke the vslock() and vsunlock() implementations, which are no longer used. Add a member to struct sysctl_req to track the amount of memory that is wired to handle the request. Modify sysctl_wire_old_buffer() to return an error if its call to kern_mlock() fails. Only wire the minimum of the length specified in the sysctl request and the length specified in its argument list. It is recommended that sysctl handlers that use sysctl_wire_old_buffer() should specify reasonable estimates for the amount of data they want to return so that only the minimum amount of memory is wired no matter what length has been specified by the request. Modify the callers of sysctl_wire_old_buffer() to look for the error return. Modify sysctl_old_user to obey the wired buffer length and clean up its implementation. Reviewed by: bms
# 126094	21-Feb-2004	phk	Check for NODEV return from udev2dev()
# 126082	21-Feb-2004	phk	Device megapatch 6/6: This is what we came here for: Hang dev_t's from their cdevsw, refcount cdevsw and dev_t and generally keep track of things a lot better than we used to: Hold a cdevsw reference around all entrances into the device driver, this will be necessary to safely determine when we can unload driver code. Hold a dev_t reference while the device is open. KASSERT that we do not enter the driver on a non-referenced dev_t. Remove old D_NAG code, anonymous dev_t's are not a problem now. When destroy_dev() is called on a referenced dev_t, move it to dead_cdevsw's list. When the refcount drops, free it. Check that cdevsw->d_version is correct. If not, set all methods to the dead_*() methods to prevent entrance into driver. Print warning on console to this effect. The device driver may still explode if it is also incompatible with newbus, but in that case we probably didn't get this far in the first place.
# 126081	21-Feb-2004	phk	Device megapatch 5/6: Remove the unused second argument from udev2dev(). Convert all remaining users of makedev() to use udev2dev(). The semantic difference is that udev2dev() will only locate a pre-existing dev_t, it will not line makedev() create a new one. Apart from the tiny well controlled windown in D_PSEUDO drivers, there should no longer be any "anonymous" dev_t's in the system now, only dev_t's created with make_dev() and make_dev_alias()
# 124162	05-Jan-2004	kan	More style fixes. Obtained from: bde
# 124148	05-Jan-2004	kan	style(9): Add empty line before first code line in functions with no local variables. Properly terminate comment sentences. Indent lines which are longer that 80 characters. Move v_addpollinfo closer to the rest of poll-related functions. Move DEBUG_VFS_LOCKS ifdefed block to the end of file. Obtained from: bde (partly)
# 124118	04-Jan-2004	kan	Cosmetics: strip '\n' from a string passed to Debugger().
# 123932	28-Dec-2003	bde	v_vxproc was a bogus name for a thread (pointer).
# 123569	16-Dec-2003	jeff	- In vget() if LK_NOWAIT is specified we should return EBUSY and not ENOENT. Submitted by: Stephan Uphoff <ups@stups.com>
# 123568	16-Dec-2003	jeff	- When doing a forced unmount, VFS attempts to keep VCHR vnodes valid by reassigning their v_ops field to specfs, detaching from the mountpoint, etc. However, this is not sufficient. If we vclean() the vnode the pages owned by the vnode are lost, potentially while buffers reference them. Implement parts of vclean() seperately in vgonechrl() so that the pages and bufs associated with a device vnode are not destroyed while in use.
# 123072	30-Nov-2003	jeff	- Don't forget to unlock the vnode interlock in the LK_NOWAIT case. Submitted by: Stephan Uphoff <ups@stups.com> Approved by: re (rwatson)
# 122352	09-Nov-2003	tanimura	- Implement selwakeuppri() which allows raising the priority of a thread being waken up. The thread waken up can run at a priority as high as after tsleep(). - Replace selwakeup()s with selwakeuppri()s and pass appropriate priorities. - Add cv_broadcastpri() which raises the priority of the broadcast threads. Used by selwakeuppri() if collision occurs. Not objected in: -arch, -current
# 122091	05-Nov-2003	kan	Remove mntvnode_mtx and replace it with per-mountpoint mutex. Introduce two new macros MNT_ILOCK(mp)/MNT_IUNLOCK(mp) to operate on this mutex transparently. Eventually new mutex will be protecting more fields in struct mount, not only vnode list. Discussed with: jeff
# 121441	23-Oct-2003	wollman	Add appropriate const poisoning to the assert_*locked() family so that I can call ASSERT_VOP_LOCKED(vp, __func__) without a diagnostic. Inspired by: the evil and rude OpenAFS cache manager code
# 121287	20-Oct-2003	alc	Initialize the buf's b_object in pbgetvp(). Clear it in pbrelvp(). (This facilitates synchronization of the vm page's valid field using the vm object's lock.) Suggested by: tegge
# 121157	17-Oct-2003	phk	Simplify count_dev()
# 121038	12-Oct-2003	phk	Simplify vn_isdisk() a bit.
# 121012	11-Oct-2003	jeff	- Fix a typo, I meant & and not \|. This was causing lockups from the syncer looping forever due to list corruption. Solved by: tegge
# 120791	05-Oct-2003	jeff	- Fix an XXX. Check the error of vn_lock() in vflush(). Don't specify LK_RETRY either, we don't want this vnode if it turns into another. - Remove the code that checks the mount point after acquiring the lock we are guaranteed to either fail or get the vnode that we wanted.
# 120780	05-Oct-2003	jeff	- Rename vcanrecycle() to vtryrecycle() to reflect its new role. - In vtryrecycle() try to vgonel the vnode if all of the previous checks passed. We won't vgonel if someone has either acquired a hold or usecount or started the vgone process elsewhere. This is because we may have been removed from the free list while we were inspecting the vnode for recycling. - The VI_TRYLOCK stops two threads from entering getnewvnode() and recycling the same vnode. To further reduce the likelyhood of this event, requeue the vnode on the tail of the list prior to calling vtryrecycle(). We can not actually remove the vnode from the list until we know that it's going to be recycled because other interlock holders may see the VI_FREE flag and try to remove it from the free list. - Kill a bogus XXX comment. If XLOCK is set we shouldn't wait for it regardless of MNT_WAIT because the vnode does not actually belong to this filesystem.
# 120779	05-Oct-2003	jeff	- Don't cache_purge() in getnewvnode. It's done in vclean(). With this purge, the purge in vclean, and the filesystems purge, we had 3 purges per vnode. - Move the insmntque(vp, 0) to vclean() so that we may remove it from the two vgone() functions and reduce the number of lock operations required.
# 120773	05-Oct-2003	jeff	- Solve a LOR with the sync_mtx by using the VI_ONWORKLST flag to determine whether or not the sync failed. This could potentially get set between the time that we VOP_UNLOCK and VI_LOCK() but the race would harmelssly lead to the sync being delayed by an extra 30 seconds. If we do not move the vnode it could cause an endless loop if it continues to fail to sync. - Use vhold and vdrop to stop the vnode from changing identities while we have it unlocked. Other internal vfs lists are likely to follow this scheme.
# 120771	05-Oct-2003	jeff	- Move the xlock 'locking' code into vx_lock() and vx_unlock(). - Create a new function, vgonechrl(), which performs vgone for an in-use character device. Move the code from vflush() that did this into vgonechrl(). - Hold the xlock across the entirety of vgonel() and vgonechrl() so that at no point will an invalid vnode exist on any list without XLOCK set. - Move the xlock code out of vclean() now that it is in the vgone*() functions.
# 120756	04-Oct-2003	jeff	- In sched_sync() test our preconditions prior to dropping the sync_mtx. This is so that we may grab the interlock while still holding the sync_mtx. We have to VI_TRYLOCK() because in all other cases the lock order runs the other way. - If we don't meet any of the preconditions, reinsert the vp into the list for the next second. - We don't need to panic if we fail to sync here because each FSYNC function handles this case. Removing this redundant code also simplifies locking.
# 120746	04-Oct-2003	jeff	- In a Giantless world, the vn_lock() in vcanrecycle() could legitimately fail. Remove the panic from that case and document why it might fail. - Document the reason for calling cache_purge() on a newly created vnode. - In insmntque() order the operations so that we can call mtx_unlock() one fewer times. This makes the code somewhat clearer as well. - Add XXX comments in sched_sync() and vflush(). - In vget(), do not sleep while waiting for XLOCK to clear if LK_NOWAIT is set. - In vclean() we don't need to acquire a lock around a single TAILQ_FIRST call. It's ok if we race here, the vinvalbuf will just do nothing. - Increase the scope of the lock in vgonel() to reduce the number of lock operations that are performed.
# 120271	20-Sep-2003	jeff	- In reassignbuf() don't unlock vp and lock newvp if they are the same. Doing so creates a race where the buf is on neither list. - Only vfree() in an error case in vclean() if VSHOULDFREE() thinks we should. - Convert the error case in vclean() to INVARIANTS from DIAGNOSTIC as this really should not happen and is fast to check.
# 120265	19-Sep-2003	jeff	- Remove spls(). The locking that has replaced them is in place and they no longer serve as guidelines for future work.
# 120241	19-Sep-2003	kan	Eliminate one case of VI_UNLOCK followed by an immediate VI_LOCK.
# 118607	07-Aug-2003	jhb	Consistently use the BSD u_int and u_short instead of the SYSV uint and ushort. In most of these files, there was a mixture of both styles and this change just makes them self-consistent. Requested by: bde (kern_ktrace.c)
# 117879	22-Jul-2003	phk	Revert stuff which accidentally ended up in the previous commit.
# 117878	22-Jul-2003	phk	Don't attempt to inline large functions mb_alloc() and mb_free(), it more than doubles the text size of this file. GCC has wisely ignored us on this previously
# 116182	11-Jun-2003	obrien	Use __FBSDID().
# 115536	31-May-2003	phk	Remove unused variable and now unbalanced call to splbio(); Found by: FlexeLint
# 115266	23-May-2003	alc	Make the maximum number of vnodes a function of both the physical memory size and the kernel's heap size, specifically, vm_kmem_size. This function allows a maximum of 40% of the vm_kmem_size to be used for vnodes and vm objects. This is a conservative bound based upon recent problem reports. (In other words, a slight increase in this percentage may be safe.) Finally, machines with less than ~3GB of RAM should be unaffected by this change, i.e., the maximum number of vnodes should remain the same. If necessary, machines with 3GB or more of RAM can increase the maximum number of vnodes by increasing vm_kmem_size. Desired by: scottl Tested by: jake Approved by: re (rwatson,scottl)
# 115076	16-May-2003	truckman	Detect that a vnode has been reclaimed while vflush() was waiting to lock the vnode and restart the loop. Vflush() is vulnerable since it does not hold a reference to the vnode and it holds no other locks while waiting for the vnode lock. The vnode will no longer be on the list when the loop is restarted. Approved by: re (rwatson)
# 114968	13-May-2003	alc	Optimize the use of splay in gbincore(). During a "make buildworld" the desired buffer is found at one of the roots more than 60% of the time. Thus, checking both roots before performing either splay eliminates unnecessary splays on the first tree splayed. Approved by: re (jhb)
# 114945	12-May-2003	rwatson	Remove bogus locking from DDB's "show lockedvnods" command: using synchronization primitives from inside DDB is generally a bad idea, and in this case it frequently results in panics due to DDB commands being executed from the sio fast interrupt context on a serial console. Replace the locking with a note that a lack of locking means that DDB may get see inconsistent views of the mount and vnode lists, which could also result in a panic. More frequently, though, this avoids a panic than causes it. Discussed with ages ago: bde Approved by: re (scottl)
# 114570	03-May-2003	alc	- Revert kern/vfs_subr.c revision 1.444. The vm_object's size isn't trustworthy for vnode-backed objects. - Restore the old behavior of vm_object_page_remove() when the end of the given range is zero. Add a comment to vm_object_page_remove() regarding this behavior. Reported by: iedowse
# 114371	01-May-2003	alc	Lock accesses to the vm_object's ref_count and resident_page_count.
# 114091	26-Apr-2003	alc	Various changes to vm_object_page_remove(): - Eliminate an odd, special-case feature: if start == end == 0 then all pages are removed. Only one caller used this feature and that caller can trivially pass the object's size. - Assert that the vm_object is locked on entry; don't bother testing for a NULL vm_object. - Style: Fix lines that are longer than 80 characters.
# 114074	26-Apr-2003	alc	- Convert vm_object_pip_wait() from using tsleep() to msleep(). - Make vm_object_pip_sleep() static. - Lock the vm_object when performing vm_object_pip_wait().
# 113955	24-Apr-2003	alc	- Acquire the vm_object's lock when performing vm_object_page_clean(). - Add a parameter to vm_pageout_flush() that tells vm_pageout_flush() whether its caller has locked the vm_object. (This is a temporary measure to bootstrap vm_object locking.)
# 113671	18-Apr-2003	alc	Update locking around vm_object_page_remove() to use the new macros.
# 113426	13-Apr-2003	alc	Use vm_object_pip_wait() rather than reimplementing it.
# 112693	26-Mar-2003	tegge	Adjust the number of vnodes scanned by vlrureclaim() according to the size of the vnode list.
# 112495	22-Mar-2003	yar	We shouldn't assert that a vode is locked in vop_lock_post() if VOP_LOCK() has failed. Reviewed by: jeff
# 112182	13-Mar-2003	jeff	- Remove a dead check for bp->b_vp == vp in vtruncbuf(). This has not been possible for some time. - Lock the buf before accessing fields. This should very rarely be locked. - Assert that B_DELWRI is set after we acquire the buf. This should always be the case now.
# 112181	13-Mar-2003	jeff	- Remove a race between fsync like functions and flushbufqueues() by requiring locked bufs in vfs_bio_awrite(). Previously the buf could have been written out by fsync before we acquired the buf lock if it weren't for giant. The cluster_wbuild() handles this race properly but the single write at the end of vfs_bio_awrite() would not. - Modify flushbufqueues() so there is only one copy of the loop. Pass a parameter in that says whether or not we should sync bufs with deps. - Call flushbufqueues() a second time and then break if we couldn't find any bufs without deps.
# 111937	06-Mar-2003	alc	Remove ENABLE_VFS_IOOPT. It is a long unfinished work-in-progress. Discussed on: arch@
# 111841	03-Mar-2003	njl	Finish cleanup of vprint() which was begun with changing v_tag to a string. Remove extraneous uses of vop_null, instead defering to the default op. Rename vnode type "vfs" to the more descriptive "syncer". Fix formatting for various filesystems that use vop_print.
# 111723	02-Mar-2003	jeff	- Hold the vnode interlock across calls to bgetvp instead of acquiring it internally. This is required to stop multiple bufs from being associated with a single lblkno.
# 111694	01-Mar-2003	jeff	- gc USE_BUFHASH. The smp locking of the buf cache renders this useless.
# 111466	25-Feb-2003	mckusick	Prevent large files from monopolizing the system buffers. Keep track of the number of dirty buffers held by a vnode. When a bdwrite is done on a buffer, check the existing number of dirty buffers associated with its vnode. If the number rises above vfs.dirtybufthresh (currently 90% of vfs.hidirtybuffers), one of the other (hopefully older) dirty buffers associated with the vnode is written (using bawrite). In the event that this approach fails to curb the growth in it the vnode's number of dirty buffers (due to soft updates rollback dependencies), the more drastic approach of doing a VOP_FSYNC on the vnode is used. This code primarily affects very large and actively written files such as snapshots. This change should eliminate hanging when taking snapshots or doing background fsck on very large filesystems. Hopefully, one day it will be possible to cache filesystem metadata in the VM cache as is done with file data. As it stands, only the buffer cache can be used which limits total metadata storage to about 20Mb no matter how much memory is available on the system. This rather small memory gets badly thrashed causing a lot of extra I/O. For example, taking a snapshot of a 1Tb filesystem minimally requires about 35,000 write operations, but because of the cache thrashing (we only have about 350 buffers at our disposal) ends up doing about 237,540 I/O's thus taking twenty-five minutes instead of four if it could run entirely in the cache. Reported by: Attila Nagy <bra@fsn.hu> Sponsored by: DARPA & NAI Labs.
# 111463	25-Feb-2003	jeff	- Add an interlock argument to BUF_LOCK and BUF_TIMELOCK. - Remove the buftimelock mutex and acquire the buf's interlock to protect these fields instead. - Hold the vnode interlock while locking bufs on the clean/dirty queues. This reduces some cases from one BUF_LOCK with a LK_NOWAIT and another BUF_LOCK with a LK_TIMEFAIL to a single lock. Reviewed by: arch, mckusick
# 111333	23-Feb-2003	phk	Bracket the kern.vnode sysctl in #ifdef notyet because it results in massive locking issues on diskless systems. It is also not clear that this sysctl is non-dangerous in its requirements for locked down memory on large RAM systems.
# 111119	19-Feb-2003	imp	Back out M_* changes, per decision of the TRB. Approved by: trb
# 109623	21-Jan-2003	alfred	Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
# 108399	29-Dec-2002	iedowse	Add a new vnode flag VI_DOINGINACT to indicate that a VOP_INACTIVE call is in progress on the vnode. When vput() or vrele() sees a 1->0 reference count transition, it now return without any further action if this flag is set. This flag is necessary to avoid recursion into VOP_INACTIVE if the filesystem inactive routine causes the reference count to increase and then drop back to zero. It is also used to guarantee that an unlocked vnode will not be recycled while blocked in VOP_INACTIVE(). There are at least two cases where the recursion can occur: one is that the softupdates code called by ufs_inactive() via ffs_truncate() can call vput() on the vnode. This has been reported by many people as "lockmgr: draining against myself" panics. The other case is that nfs_inactive() can call vget() and then vrele() on the vnode to clean up a sillyrename file. Reviewed by: mckusick (an older version of the patch)
# 108390	29-Dec-2002	phk	Use a timeout of one second while we wait for the vnode washer, this prevents a potential race and makes the system a little bit less jerky under extreme loads.
# 108388	29-Dec-2002	phk	Vnodes pull in 800-900 bytes these days, all things counted, so we need to treat desiredvnodes much more like a limit than as a vague concept. On a 2GB RAM machine where desired vnodes is 130k, we run out of kmem_map space when we hit about 190k vnodes. If we wake up the vnode washer in getnewvnode(), sleep until it is done, so that it has a chance to offer us a washed vnode. If we don't sleep here we'll just race ahead and allocate yet a vnode which will never get freed. In the vnodewasher, instead of doing 10 vnodes per mountpoint per rotation, do 10% of the vnodes distributed evenly across the mountpoints.
# 108367	28-Dec-2002	phk	KASSERT that vop_revoke() gets a VCHR.
# 107891	15-Dec-2002	alc	Perform vm_object_lock() and vm_object_unlock() around vm_object_page_remove().
# 107676	08-Dec-2002	alc	To avoid lock order reversals in getnewvnode(), the call to uma_zfree() must be delayed until the vnode interlock is released. Reported by: kris@ Approved by: re (jhb)
# 107319	27-Nov-2002	robert	Do not set a variable (vp->p_pollinfo) to NULL if we know it already has that value. Approved by: re
# 105988	26-Oct-2002	rwatson	Slightly change the semantics of vnode labels for MAC: rather than "refreshing" the label on the vnode before use, just get the label right from inception. For single-label file systems, set the label in the generic VFS getnewvnode() code; for multi-label file systems, leave the labeling up to the file system. With UFS1/2, this means reading the extended attribute during vfs_vget() as the inode is pulled off disk, rather than hitting the extended attributes frequently during operations later, improving performance. This also corrects sematics for shared vnode locks, which were not previously present in the system. This chances the cache coherrency properties WRT out-of-band access to label data, but in an acceptable form. With UFS1, there is a small race condition during automatic extended attribute start -- this is not present with UFS2, and occurs because EAs aren't available at vnode inception. We'll introduce a work around for this shortly. Approved by: re Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 105914	25-Oct-2002	phk	In vrele() we can actually have a VCHR with v_rdev == NULL if we came from the bottom of addaliasu(). Don't panic.
# 105902	25-Oct-2002	mckusick	Within ufs, the ffs_sync and ffs_fsync functions did not always check for and/or report I/O errors. The result is that a VFS_SYNC or VOP_FSYNC called with MNT_WAIT could loop infinitely on ufs in the presence of a hard error writing a disk sector or in a filesystem full condition. This patch ensures that I/O errors will always be checked and returned. This patch also ensures that every call to VFS_SYNC or VOP_FSYNC with MNT_WAIT set checks for and takes appropriate action when an error is returned. Sponsored by: DARPA & NAI Labs.
# 105894	24-Oct-2002	phk	Fix the spechash lock order reversal by keeping an updated sum of v_usecount in the dev_t which vcount() can return without locking any vnodes. Seen by: jhb
# 105121	14-Oct-2002	mckusick	When scanning the freelist looking for candidate vnodes to recycle, be sure to exit the loop with vp == NULL if no candidates are found. Formerly, this bug would cause the last vnode inspected to be used, even if it was not available. The result was a panic "vn_finished_write: neg cnt". Sponsored by: DARPA & NAI Labs.
# 105119	14-Oct-2002	mckusick	Unconditionally reset vp->v_vnlock back to the default in the vclean() function (e.g., vp->v_vnlock = &vp->v_lock) rather than requiring filesystems that use alternate locks to do so in their vop_reclaim functions. This change is a further cleanup of the vop_stdlock interface. Submitted by: Poul-Henning Kamp <phk@critter.freebsd.dk> Sponsored by: DARPA & NAI Labs.
# 105077	14-Oct-2002	mckusick	Regularize the vop_stdlock'ing protocol across all the filesystems that use it. Specifically, vop_stdlock uses the lock pointed to by vp->v_vnlock. By default, getnewvnode sets up vp->v_vnlock to reference vp->v_lock. Filesystems that wish to use the default do not need to allocate a lock at the front of their node structure (as some still did) or do a lockinit. They can simply start using vn_lock/VOP_UNLOCK. Filesystems that wish to manage their own locks, but still use the vop_stdlock functions (such as nullfs) can simply replace vp->v_vnlock with a pointer to the lock that they wish to have used for the vnode. Such filesystems are responsible for setting the vp->v_vnlock back to the default in their vop_reclaim routine (e.g., vp->v_vnlock = &vp->v_lock). In theory, this set of changes cleans up the existing filesystem lock interface and should have no function change to the existing locking scheme. Sponsored by: DARPA & NAI Labs.
# 104829	11-Oct-2002	mckusick	When considering a vnode for reuse in getnewvnode, we call vcanrecycle to check a free vnode's availability. If it is available, vcanrecycle returns an error code of zero and the vnode in question locked. The getnewvnode routine then used to call vn_start_write with the V_NOWAIT flag. If the filesystem was suspended while taking a snapshot, the vn_start_write would fail but getnewvnode would fail to unlock the vnode, instead leaving it locked on the freelist. The result would be that the vnode would be locked forever and would eventually hang the system with a race to the root when it was attempted to recycle it. This fix moves the vn_start_write check into vcanrecycle where it will properly unlock the vnode if it is unavailable for recycling due to filesystem suspension. Sponsored by: DARPA & NAI Labs.
# 104509	05-Oct-2002	sobomax	Fix problem introduced in rev.1.406, which can cause already unlocked mutex being unlocked again causing system panic.
# 104302	01-Oct-2002	phk	Fix some harmless mis-indents. Spotted by: FlexeLint
# 104237	30-Sep-2002	rwatson	Move vnode MAC label initialization to after the release of the vnode interlock in getnewvnode() to avoid possible sleeps while holding the mutex. Note that the warning from Witness is a slight false positive since we know there will be no contention on the interlock since we haven't made the vnode available for use yet, but the theory is not a bad one. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 104094	28-Sep-2002	phk	Be consistent about "static" functions: if the function is marked static in its prototype, mark it static at the definition too. Inspired by: FlexeLint warning #512
# 103986	26-Sep-2002	jeff	- Move ASSERT_VOP_LOCK functionality into functions in vfs_subr.c - Make the VI asserts more orthogonal to the rest of the asserts by using a new, common vfs_badlock() function and adding a 'str' arg. - Adjust generated ASSERTS to match the new prototype. - Adjust explicit ASSERTS to match the new prototype.
# 103933	25-Sep-2002	jeff	- Lock down the syncer with sync_mtx. - Enable vfs_badlock_mutex by default. - Assert that the vp is locked in VOP_UNLOCK. - Use standard interlock macros in remaining code. - Correct a race in getnewvnode(). - Lock access to v_numoutput with interlock. - Lock access to buf lists and splay tree with interlock. - Add VOP and VI asserts. - Lock b_vnbufs with the vnode interlock. - Add vrefcnt() for callers who want to retreive the vnode ref without holding a lock. Add a comment that describes when this is safe. - Add vholdl() and vdropl() so that callers who already own the interlock can avoid race conditions and unnecessary unlocking. - Move the VOP_GETATTR() in vflush() into the WRITECLOSE conditional case. - Hold the interlock before droping the mntlist_mtx in vflush() to avoid a race. - Fix locking in vfs_msync().
# 103559	18-Sep-2002	njl	Remove any VOP_PRINT that redundantly prints the tag. Move lockmgr_printinfo() into vprint() for everyone's benefit. Suggested by: bde
# 103314	14-Sep-2002	njl	Remove all use of vnode->v_tag, replacing with appropriate substitutes. v_tag is now const char * and should only be used for debugging. Additionally: 1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK 2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP. Suggested by: phk Reviewed by: bde, rwatson (earlier version)
# 103228	11-Sep-2002	julian	Indentation does not make a block.. need curly braces too. Submitted by: Eagle-eyes evans <bde@freebsd.org>
# 103216	11-Sep-2002	julian	Completely redo thread states. Reviewed by: davidxu@freebsd.org
# 102989	05-Sep-2002	phk	Fix an inherited style bug: compare with NOCRED instead of NULL. Sponsored by: DARPA & NAI Labs.
# 102987	05-Sep-2002	phk	Introduce new extattr_check_cred() function which implements the canonical crential washing for extended attributes. Sponsored by: DARPA & NAI Labs.
# 102412	25-Aug-2002	charnier	Replace various spelling with FALLTHROUGH which is lint()able
# 102297	23-Aug-2002	jeff	- Fix a mistake in my last few commits. The PDROP flag stops msleep from re-acquiring the mutex. Pointy hat to: me Noticed by: tegge
# 102255	22-Aug-2002	jeff	- Make vn_lock() vget() and VOP_LOCK() all behave the same way WRT LK_INTERLOCK. The interlock will never be held on return from these functions even when there is an error. Errors typically only occur when the XLOCK is held which means this isn't the vnode we want anyway. Almost all users of these interfaces expected this behavior even though it was not provided before.
# 102251	22-Aug-2002	jeff	- Fix interlock handling in vn_lock(). Previously, vn_lock() could return with interlock held in error conditions when the caller did not specify LK_INTERLOCK. - Add several comments to vn_lock() describing the rational behind the code flow since it was not immediately obvious.
# 102214	21-Aug-2002	jeff	- Document two cases, one in vget and the other in vn_lock, where the state of interlock on exit is not consistent. There are probably several bugs relating to this.
# 102211	21-Aug-2002	jeff	- If vn_lock fails with the LK_INTERLOCK flag set, interlock will not be released. vcanrecycle() failed to unlock interlock under this condition. - Remove an extra VOP_UNLOCK from a failure case in vcanrecycle(). Pointed out by: rwatson
# 102210	21-Aug-2002	jeff	- Add two new debugging macros: ASSERT_VI_LOCKED and ASSERT_VI_UNLOCKED - Use the new VI asserts in place of the old mtx_assert checks. - Add the VI asserts to the automated lock checking in the VOP calls. The interlock should not be held across vops with a few exceptions. - Add the vop_(un)lock_{pre,post} functions to assert that interlock is held when LK_INTERLOCK is set.
# 101769	13-Aug-2002	jeff	- Extend the vnode_free_list_mtx to cover numvnodes and freevnodes. This was done only some of the time before, and now it is uniformly applied.
# 101651	10-Aug-2002	mux	- Introduce a new struct xvfsconf, the userland version of struct vfsconf. - Make getvfsbyname() take a struct xvfsconf *. - Convert several consumers of getvfsbyname() to use struct xvfsconf. - Correct the getvfsbyname.3 manpage. - Create a new vfs.conflist sysctl to dump all the struct xvfsconf in the kernel, and rewrite getvfsbyname() to use this instead of the weird existing API. - Convert some {set,get,end}vfsent() consumers to use the new vfs.conflist sysctl. - Convert a vfsload() call in nfsiod.c to kldload() and remove the useless vfsisloadable() and endvfsent() calls. - Add a warning printf() in vfs_sysctl() to tell people they are using an old userland. After these changes, it's possible to modify struct vfsconf without breaking the binary compatibility. Please note that these changes don't break this compatibility either. When bp will have updated mount_smbfs(8) with the patch I sent him, there will be no more consumers of the {set,get,end}vfsent(), vfsisloadable() and vfsload() API, and I will promptly delete it.
# 101367	05-Aug-2002	jeff	- Move some logic from getnewvnode() to a new function vcanrecycle() - Unlock the free list mutex around vcanrecycle to prevent a lock order reversal.
# 101308	04-Aug-2002	jeff	- Replace v_flag with v_iflag and v_vflag - v_vflag is protected by the vnode lock and is used when synchronization with VOP calls is needed. - v_iflag is protected by interlock and is used for dealing with vnode management issues. These flags include X/O LOCK, FREE, DOOMED, etc. - All accesses to v_iflag and v_vflag have either been locked or marked with mp_fixme's. - Many ASSERT_VOP_LOCKED calls have been added where the locking was not clear. - Many functions in vfs_subr.c were restructured to provide for stronger locking. Idea stolen from: BSD/OS
# 101173	01-Aug-2002	rwatson	Include file cleanup; mac.h and malloc.h at one point had ordering relationship requirements, and no longer do. Reminded by: bde
# 101041	31-Jul-2002	des	Nit in previous commit: the correct sysctl type is "S,xvnode"
# 101040	31-Jul-2002	des	Initialize v_cachedid to -1 in getnewvnode(). Reintroduce the kern.vnode sysctl and make it export xvnodes rather than vnodes. Sponsored by: DARPA, NAI Labs
# 101009	31-Jul-2002	rwatson	Note that the privilege indicating flag to vaccess() originally used by the process accounting system is now deprecated.
# 101008	31-Jul-2002	rwatson	Introduce support for Mandatory Access Control and extensible kernel access control. Invoke the necessary MAC entry points to maintain labels on vnodes. In particular, initialize the label when the vnode is allocated or reused, and destroy the label when the vnode is going to be released, or reused. Wow, an object where there really is exactly one place where it's allocated, and one other where it's freed. Amazing. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 100863	29-Jul-2002	jeff	- Backout the patch made in revision 1.75 of vfs_mount.c. The vputs here were hiding the real problem of the missing unlock in sync_inactive. - Add the missing unlock in sync_inactive. Submitted by: iedowse
# 100831	28-Jul-2002	truckman	Wire the sysctl output buffer before grabbing any locks to prevent SYSCTL_OUT() from blocking while locks are held. This should only be done when it would be inconvenient to make a temporary copy of the data and defer calling SYSCTL_OUT() until after the locks are released.
# 100481	22-Jul-2002	rwatson	Teach discretionary access control methods for files about VAPPEND and VALLPERM. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 100344	19-Jul-2002	mckusick	Add support to UFS2 to provide storage for extended attributes. As this code is not actually used by any of the existing interfaces, it seems unlikely to break anything (famous last words). The internal kernel interface to manipulate these attributes is invoked using two new IO_ flags: IO_NORMAL and IO_EXT. These flags may be specified in the ioflags word of VOP_READ, VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that you want to do I/O to the normal data part of the file and IO_EXT means that you want to do I/O to the extended attributes part of the file. IO_NORMAL and IO_EXT are mutually exclusive for VOP_READ and VOP_WRITE, but may be specified individually or together in the case of VOP_TRUNCATE. For example, when removing a file, VOP_TRUNCATE is called with both IO_NORMAL and IO_EXT set. For backward compatibility, if neither IO_NORMAL nor IO_EXT is set, then IO_NORMAL is assumed. Note that the BA_ and IO_ flags have been `merged' so that they may both be used in the same flags word. This merger is possible by assigning the IO_ flags to the low sixteen bits and the BA_ flags the high sixteen bits. This works because the high sixteen bits of the IO_ word is reserved for read-ahead and help with write clustering so will never be used for flags. This merge lets us get away from code of the form: if (ioflags & IO_SYNC) flags \|= BA_SYNC; For the future, I have considered adding a new field to the vattr structure, va_extsize. This addition could then be exported through the stat structure to allow applications to find out the size of the extended attribute storage and also would provide a more standard interface for truncating them (via VOP_SETATTR rather than VOP_TRUNCATE). I am also contemplating adding a pathconf parameter (for concreteness, lets call it _PC_MAX_EXTSIZE) which would let an application determine the maximum size of the extended atribute storage. Sponsored by: DARPA & NAI Labs.
# 100207	17-Jul-2002	mckusick	Change utimes to set the file creation time (for filesystems that support creation times such as UFS2) to the value of the modification time if the value of the modification time is older than the current creation time. See utimes(2) for further details. Sponsored by: DARPA & NAI Labs.
# 99737	10-Jul-2002	dillon	Replace the global buffer hash table with per-vnode splay trees using a methodology similar to the vm_map_entry splay and the VM splay that Alan Cox is working on. Extensive testing has appeared to have shown no increase in overhead. Disadvantages Dirties more cache lines during lookups. Not as fast as a hash table lookup (but still N log N and optimal when there is locality of reference). Advantages vnode->v_dirtyblkhd is now perfectly sorted, making fsync/sync/filesystem syncer operate more efficiently. I get to rip out all the old hacks (some of which were mine) that tried to keep the v_dirtyblkhd tailq sorted. The per-vnode splay tree should be easier to lock / SMPng pushdown on vnodes will be easier. This commit along with another that Alan is working on for the VM page global hash table will allow me to implement ranged fsync(), optimize server-side nfs commit rpcs, and implement partial syncs by the filesystem syncer (aka filesystem syncer would detect that someone is trying to get the vnode lock, remembers its place, and skip to the next vnode). Note that the buffer cache splay is somewhat more complex then other splays due to special handling of background bitmap writes (multiple buffers with the same lblkno in the same vnode), and B_INVAL discontinuities between the old hash table and the existence of the buffer on the v_cleanblkhd list. Suggested by: alc
# 99690	09-Jul-2002	jeff	- Use standard locking functions in syncer's opv - vput instead of vrele syncer vnodes in vfs_mount - Add vop_lookup_{pre,post} to verify locking in VOP_LOOKUP
# 99515	07-Jul-2002	jeff	- Don't hold the vn lock while calling VOP_CLOSE in vclean().
# 99513	07-Jul-2002	jeff	- BUF_REFCNT() seems to be the preferred method for verifying a locked buf. Tell vop_strategy_pre() to use this instead. - Ignore B_CLUSTER bufs. Their components are locked but they don't really exist so they don't have to be. This isn't ideal but it is safe.
# 99508	06-Jul-2002	jeff	Fix a mistake in my last commit. Don't grab an extra reference to the object in bp->b_object.
# 99489	06-Jul-2002	jeff	Fixup uses of GETVOBJECT. - Cache a pointer to the vnode's object in the buf. - Hold a reference to that object in addition to the vnode's reference just to be consistent. - Cleanup code that got the object indirectly through the vp and VOP calls. This fixes at least one case where we were calling GETVOBJECT without a lock. It also avoids an expensive layered call at the cost of another pointer in struct buf.
# 99485	06-Jul-2002	jeff	- Add vop_strategy_pre to validate VOP_STRATEGY locking. - Disable original vop_strategy lock specification. - Switch to the new vop_strategy_pre for lock validation. VOP_STRATEGY requires only that the buf is locked UNLESS the block numbers need to be translated. There may be other reasons, but as long as the underlying layer uses a VOP to perform the operations they will be caught later.
# 99483	06-Jul-2002	jeff	Add "vop_rename_pre" to do pre rename lock verification. This is enabled only with DEBUG_VFS_LOCKS.
# 99338	03-Jul-2002	mux	Move vfs_rootmountalloc() in vfs_mount.c and remove lite2_vfs_mountroot() which was #if 0'd and is not likely to be used now.
# 99264	02-Jul-2002	mux	Move every code related to mount(2) in a new file, vfs_mount.c. The file vfs_conf.c which was dealing with root mounting has been repo-copied into vfs_mount.c to preserve history. This makes nmount related development easier, and help reducing the size of vfs_syscalls.c, which is still an enormous file. Reviewed by: rwatson Repo-copy by: peter
# 99220	01-Jul-2002	iedowse	Use indirect function pointer hooks instead of #ifdef SOFTUPDATES direct calls for the two places where the kernel calls into soft updates code. Set up the hooks in softdep_initialize() and NULL them out in softdep_uninitialize(). This change allows soft updates to function correctly when ufs is loaded as a module. Reviewed by: mckusick
# 99021	29-Jun-2002	obrien	Rename the db command lockedvnodes to lockedvnods so that it fits on the help screen and one doens't think we have a lockedvnodesmap command.
# 98994	28-Jun-2002	alfred	nuke caddr_t.
# 98985	28-Jun-2002	jeff	Improve the VOP locking asserts - Add vfs_badlock_print to control whether or not we print lock violations - Add vfs_badlock_panic to control whether we panic on lock violations Both default to on to mimic the original behavior if DEBUG_VFS_LOCKS is on.
# 98979	28-Jun-2002	green	Fix a case where a vnode got explicitly unlocked after the pointer to it got set to NULL. Revision 1.355: in the box
# 98510	20-Jun-2002	mux	Change the way we internally store the mount options to a linked list. This is to allow the merging of the mount options in the MNT_UPDATE case, as the current data structure is unsuitable for this. There are no functional differences in this commit. Reviewed by: phk
# 98233	14-Jun-2002	mux	Change vfs_copyopt() so that the length argument passed to it must be the exact same size as the mount option. This makes vfs_copyopt() much more useful.
# 97935	06-Jun-2002	des	Move some sysctls from the debug tree to the vfs tree.
# 97934	06-Jun-2002	des	Gratuitous whitespace cleanup.
# 96755	16-May-2002	trhodes	More s/file system/filesystem/g
# 96744	16-May-2002	mux	o Fix vfs_copyopt(), the first argument to bcopy() is the source, not the destination. o Remove some code from vfs_getopt() which was making the interface more complicated to use for a very slight gain.
# 96145	07-May-2002	jeff	Switch from just holding the interlock to holding the standard lock throughout getnewvnode(). This is safer. In the future, we should investigate requiring only the interlock to get the vnode object.
# 96094	06-May-2002	jeff	Hold the currently selected vnode's lock across the call to VOP_GETVOBJECT. Don't try to create a vm object before the file system has a chance to finish initializing it. This is incorrect for a number of reasons. Firstly, that VOP requires a lock which the file system may not have initialized yet. Also, open and others will create a vm object if it is necessary later.
# 96073	05-May-2002	phk	Expand the one-line function pbreassignbuf() the only place it is or could be used.
# 96034	04-May-2002	dillon	Remove obsolete code (that was already #if 0'd out). Requested by: Hiten Pandya <hitmaster2k@yahoo.com>
# 93818	04-Apr-2002	jhb	Change callers of mtx_init() to pass in an appropriate lock type name. In most cases NULL is passed, but in some cases such as network driver locks (which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used. Tested on: i386, alpha, sparc64
# 93593	01-Apr-2002	jhb	Change the suser() API to take advantage of td_ucred as well as do a general cleanup of the API. The entire API now consists of two functions similar to the pre-KSE API. The suser() function takes a thread pointer as its only argument. The td_ucred member of this thread must be valid so the only valid thread pointers are curthread and a few kernel threads such as thread0. The suser_cred() function takes a pointer to a struct ucred as its first argument and an integer flag as its second argument. The flag is currently only used for the PRISON_ROOT flag. Discussed on: smp@
# 93228	26-Mar-2002	mux	As discussed in -arch, add the new nmount(2) system call and the new vfs_getopt()/vfs_copyopt() API. This is intended to be used later, when there will be filesystems implementing the VFS_NMOUNT operation. The mount(2) system call will disappear when all filesystems will be converted to the new API. Documentation will be committed in a while. Reviewed by: phk
# 92751	20-Mar-2002	jeff	Remove references to vm_zone.h and switch over to the new uma API. Also, remove maxsockets. If you look carefully you'll notice that the old zone allocator never honored this anyway.
# 92723	19-Mar-2002	alfred	Remove __P.
# 91709	05-Mar-2002	rwatson	Three p_ucred -> td_ucred's missed in jhb's earlier pass; all appear to be safe.
# 91406	27-Feb-2002	jhb	Simple p_ucred -> td_ucred changes to start using the per-thread ucred reference.
# 90860	18-Feb-2002	phk	Make v_addpollinfo() visible and non-inline. Have callers only call it as needed. Add necessary call in ufs_kqfilter(). Test-case found by: Andrew Gallatin <gallatin@cs.duke.edu>
# 90835	18-Feb-2002	phk	Remove yet a redundant VN_KNOTE() macro.
# 90791	17-Feb-2002	phk	Move the stuff related to select and poll out of struct vnode. The use of the zone allocator may or may not be overkill. There is an XXX: over in ufs/ufs/ufs_vnops.c that jlemon may need to revisit. This shaves about 60 bytes of struct vnode which on my laptop means 600k less RAM used for vnodes.
# 90375	07-Feb-2002	peter	Fix a couple of style bugs introduced (or touched by) previous commit.
# 90361	07-Feb-2002	julian	Pre-KSE/M3 commit. this is a low-functionality change that changes the kernel to access the main thread of a process via the linked list of threads rather than assuming that it is embedded in the process. It IS still embeded there but remove all teh code that assumes that in preparation for the next commit which will actually move it out. Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,
# 90100	02-Feb-2002	mckusick	In the routines vrele() and vput(), we must lock the vnode and call VOP_INACTIVE before placing the vnode back on the free list. Otherwise there is a race condition on SMP machines between getnewvnode() locking the vnode to reclaim it and vrele() locking the vnode to inactivate it. This window of vulnerability becomes exaggerated in the presence of filesystems that have been suspended as the inactive routine may need to temporarily release the lock on the vnode to avoid deadlock with the syncer process.
# 89526	19-Jan-2002	dillon	Remove 'VXLOCK: interlock avoided' warnings. This can now occur in normal operation. The vgonel() code has always called vclean() but until we started proactively freeing vnodes it would never actually be called with a dirty vnode, so this situation did not occur prior to the vnlru() code. Now that we proactively free vnodes when kern.maxvnodes is hit, however, vclean() winds up with work to do and improperly generates the warnings. Reviewed by: peter Approved by: re (for MFC) MFC after: 1 day
# 89384	15-Jan-2002	mckusick	When downgrading a filesystem from read-write to read-only, operations involving file removal or file update were not always being fully committed to disk. The result was lost files or corrupted file data. This change ensures that the filesystem is properly synced to disk before the filesystem is down-graded. This delta also fixes a long standing bug in which a file open for reading has been unlinked. When the last open reference to the file is closed, the inode is reclaimed by the filesystem. Previously, if the filesystem had been down-graded to read-only, the inode could not be reclaimed, and thus was lost and had to be later recovered by fsck. With this change, such files are found at the time of the down-grade. Normally they will result in the filesystem down-grade failing with `device busy'. If a forcible down-grade is done, then the affected files will be revoked causing the inode to be released and the open file descriptors to begin failing on attempts to read. Submitted by: "Sam Leffler" <sam@errno.com>
# 89238	10-Jan-2002	dillon	Add vlruvp() routine - implements LRU operation for vnode recycling. We calculate a trigger point that both guarentees we will find a sufficient number of vnodes to recycle and prevents us from recycling vnodes with lots of resident pages. This particular section of code is designed to recycle vnodes, not do unnecessary frees of cached VM pages.
# 88465	25-Dec-2001	dillon	Fix type-o in previous commit (tsleep was using wrong rendezvous point)
# 88318	20-Dec-2001	dillon	Fix a BUF_TIMELOCK race against BUF_LOCK and fix a deadlock in vget() against VM_WAIT in the pageout code. Both fixes involve adjusting the lockmgr's timeout capability so locks obtained with timeouts do not interfere with locks obtained without a timeout. Hopefully MFC: before the 4.5 release
# 88161	19-Dec-2001	peter	Do not initialize static/global variables to 0. Use bss instead of taking up space in the data section.
# 88160	19-Dec-2001	peter	Use a different mechanism to get the vnlru process to wake up and notice the shutdown request at reboot/halt time. Disable the printf 'vnlru process getting nowhere, pausing...' and instead export the count to the debug.vnlru_nowhere sysctl.
# 88149	18-Dec-2001	dillon	This is a forward port of Peter's vlrureclaim() fix, with some minor mods by me to make it more efficient. The original code had serious balancing problems and could also deadlock easily. This code relegates the vnode reclamation to its own kproc and relaxes the vnode reclamation requirements to better maintain kern.maxvnodes. This code still doesn't balance as well as it could, but it does a much better job then the original code. Approved by: re@freebsd.org Obtained from: ps, peter, dillon MFS Assuming: Assuming no problems crop up in Yahoo testing MFC after: 7 days
# 87847	14-Dec-2001	dillon	A slightly different version of the vlrureclaim fix. Reported by: peter, ps
# 87823	13-Dec-2001	peter	If we were called to allocate a vnode that is not associated with a mount point, do not dereference the NULL mp argument.
# 86037	04-Nov-2001	dillon	Add mnt_reservedvnlist so we can MFC to 4.x, in order to make all mount structure changes now rather then piecemeal later on. mnt_nvnodelist currently holds all the vnodes under the mount point. This will eventually be split into a 'dirty' and 'clean' list. This way we only break kld's once rather then twice. nvnodelist will eventually turn into the dirty list and should remain compatible with the klds.
# 85876	02-Nov-2001	rwatson	Merge from POSIX.1e Capabilities development tree: o POSIX.1e capabilities authorize overriding of VEXEC for VDIR based on CAP_DAC_READ_SEARCH, but of !VDIR based on CAP_DAC_EXECUTE. Add appropriate conditionals to vaccess() to take that into account. o Synchronization cap_check_xxx() -> cap_check() change. Obtained from: TrustedBSD Project
# 85606	27-Oct-2001	dillon	syncdelay, filedelay, dirdelay, metadelay are ints, not time_t's, and can also be made static.
# 85517	26-Oct-2001	dillon	Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a real effect. Optimize vfs_msync(). Avoid having to continually drop and re-obtain mutexes when scanning the vnode list. Improves looping case by 500%. Optimize ffs_sync(). Avoid having to continually drop and re-obtain mutexes when scanning the vnode list. This makes a couple of assumptions, which I believe are ok, in regards to vnode stability when the mount list mutex is held. Improves looping case by 500%. (more optimization work is needed on top of these fixes) MFC after: 1 week
# 85515	25-Oct-2001	dillon	Add missing TAILQ_INSERT_TAIL's which somehow didn't get comitted with the recent vnode cleanup.
# 85339	23-Oct-2001	dillon	Change the vnode list under the mount point from a LIST to a TAILQ in preparation for an implementation of limiting code for kern.maxvnodes. MFC after: 3 days
# 85036	16-Oct-2001	dillon	fix minor bug in kern.minvnodes sysctl. Use OID_AUTO.
# 84675	08-Oct-2001	dillon	WS Cleanup
# 84561	05-Oct-2001	dillon	vinvalbuf() was only waiting for write-I/O to complete. It really has to wait for both read AND write I/O to complete. Only NFS calls vinvalbuf() on an active vnode (when the server indicates that the file is stale), so this bug fix only effects NFS clients. MFC after: 3 days
# 84249	01-Oct-2001	dillon	After extensive testing it has been determined that adding complexity to avoid removing higher level directory vnodes from the namecache has no perceivable effect and will be removed. This is especially true when vmiodirenable is turned on, which it is by default now. ( vmiodirenable makes a huge difference in directory caching ). The vfs.vmiodirenable and vfs.nameileafonly sysctls have been left in to allow further testing, but I expect to rip out vfs.nameileafonly soon too. I have also determined through testing that the real problem with numvnodes getting too large is due to the VM Page cache preventing the vnode from being reclaimed. The directory stuff made only a tiny dent relative to Poul's original code, enough so that some tests succeeded. But tests with several million small files show that the bigger problem is the VM Page cache. This will have to be addressed by a future commit. MFC after: 3 days
# 83366	12-Sep-2001	julian	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
# 82395	27-Aug-2001	peter	If a file has been completely unlinked, stop automatically syncing the file. ffs will discard any pending dirty pages when it is closed, so we may as well not waste time trying to clean them. This doesn't stop other things from writing it out, eg: pageout, fsync(2) etc.
# 80448	27-Jul-2001	peter	Revert previous accidental commit. FWIW, it was part of enabling VM caching of disks through mmap() and stopping syncing of open files that had their last reference in the fs removed (ie: their unsync'ed pages get discarded on close already, so I made it stop syncing too).
# 80447	27-Jul-2001	peter	Fix cut/paste blunder. Serves me right for doing a last minute tweak to what I had for some time. Submitted by: bde
# 79224	04-Jul-2001	dillon	With Alfred's permission, remove vm_mtx in favor of a fine-grained approach (this commit is just the first stage). Also add various GIANT_ macros to formalize the removal of Giant, making it easy to test in a more piecemeal fashion. These macros will allow us to test fine-grained locks to a degree before removing Giant, and also after, and to remove Giant in a piecemeal fashion via sysctl's on those subsystems which the authors believe can operate without Giant.
# 78909	28-Jun-2001	jhb	- Fix a mntvnode and vnode interlock reversal. - Protect the mnt_vnode list with the mntvnode lock.
# 76827	19-May-2001	alfred	Introduce a global lock for the vm subsystem (vm_mtx). vm_mtx does not recurse and is required for most low level vm operations. faults can not be taken without holding Giant. Memory subsystems can now call the base page allocators safely. Almost all atomic ops were removed as they are covered under the vm mutex. Alpha and ia64 now need to catch up to i386's trap handlers. FFS and NFS have been tested, other filesystems will need minor changes (grabbing the vm lock when twiddling page properties). Reviewed (partially) by: jake, jhb
# 76688	16-May-2001	iedowse	Change the second argument of vflush() to an integer that specifies the number of references on the filesystem root vnode to be both expected and released. Many filesystems hold an extra reference on the filesystem root vnode, which must be accounted for when determining if the filesystem is busy and then released if it isn't busy. The old `skipvp' approach required individual filesystem xxx_unmount functions to re-implement much of vflush()'s logic to deal with the root vnode. All 9 filesystems that hold an extra reference on the root vnode got the logic wrong in the case of forced unmounts, so `umount -f' would always fail if there were any extra root vnode references. Fix this issue centrally in vflush(), now that we can. This commit also fixes a vnode reference leak in devfs, which could result in idle devfs filesystems that refuse to unmount. Reviewed by: phk, bp
# 76486	11-May-2001	iedowse	In vrele() and vput(), avoid triggering the confusing "missed vn_close" KASSERT when vp->v_usecount is zero or negative. In this case, the "v*: negative ref cnt" panic that follows is much more appropriate. Reviewed by: mckusick
# 76051	26-Apr-2001	phk	vfs_subr.c is getting rather fat. The underlying repocopy and this commit moves the filesystem export handling code to vfs_export.c
# 75934	25-Apr-2001	phk	Move the netexport structure from the fs-specific mountstructure to struct mount. This makes the "struct netexport *" paramter to the vfs_export and vfs_checkexport interface unneeded. Consequently that all non-stacking filesystems can use vfs_stdcheckexp(). At the same time, make it a pointer to a struct netexport in struct mount, so that we can remove the bogus AF_MAX and #include <net/radix.h> from <sys/mount.h>
# 75858	23-Apr-2001	grog	Correct #includes to work with fixed sys/mount.h.
# 75654	18-Apr-2001	tanimura	Reclaim directory vnodes held in namecache if few free vnodes are available. Only directory vnodes holding no child directory vnodes held in v_cache_src are recycled, so that directory vnodes near the root of the filesystem hierarchy remain in namecache and directory vnodes are not reclaimed in cascade. The period of vnode reclaiming attempt and the number of vnodes attempted to reclaim can be tuned via sysctl(2). Suggested by: tegge Approved by: phk
# 75580	17-Apr-2001	phk	This patch removes the VOP_BWRITE() vector. VOP_BWRITE() was a hack which made it possible for NFS client side to use struct buf with non-bio backing. This patch takes a more general approach and adds a bp->b_op vector where more methods can be added. The success of this patch depends on bp->b_op being initialized all relevant places for some value of "relevant" which is not easy to determine. For now the buffers have grown a b_magic element which will make such issues a tiny bit easier to debug.
# 72956	23-Feb-2001	jlemon	Add a NOTE_REVOKE flag for vnodes, which is triggered from within vclean(). Use this to tell a filter attached to a vnode that the underlying vnode is no longer valid, by returning EV_EOF. PR: kern/25309, kern/25206
# 72650	18-Feb-2001	green	Switch to using a struct xucred instead of a struct xucred when not actually in the kernel. This structure is a different size than what is currently in -CURRENT, but should hopefully be the last time any application breakage is caused there. As soon as any major inconveniences are removed, the definition of the in-kernel struct ucred should be conditionalized upon defined(_KERNEL). This also changes struct export_args to remove dependency on the constantly-changing struct ucred, as well as limiting the bounds of the size fields to the correct size. This means: a) mountd and friends won't break all the time, b) mountd and friends won't crash the kernel all the time if they don't know what they're doing wrt actual struct export_args layout. Reviewed by: bde
# 72200	09-Feb-2001	bmilekic	Change and clean the mutex lock interface. mtx_enter(lock, type) becomes: mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks) mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized) similarily, for releasing a lock, we now have: mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN. We change the caller interface for the two different types of locks because the semantics are entirely different for each case, and this makes it explicitly clear and, at the same time, it rids us of the extra `type' argument. The enter->lock and exit->unlock change has been made with the idea that we're "locking data" and not "entering locked code" in mind. Further, remove all additional "flags" previously passed to the lock acquire/release routines with the exception of two: MTX_QUIET and MTX_NOSWITCH The functionality of these flags is preserved and they can be passed to the lock/unlock routines by calling the corresponding wrappers: mtx_{lock, unlock}_flags(lock, flag(s)) and mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN locks, respectively. Re-inline some lock acq/rel code; in the sleep lock case, we only inline the _obtain_lock()s in order to ensure that the inlined code fits into a cache line. In the spin lock case, we inline recursion and actually only perform a function call if we need to spin. This change has been made with the idea that we generally tend to avoid spin locks and that also the spin locks that we do have and are heavily used (i.e. sched_lock) do recurse, and therefore in an effort to reduce function call overhead for some architectures (such as alpha), we inline recursion for this case. Create a new malloc type for the witness code and retire from using the M_DEV type. The new type is called M_WITNESS and is only declared if WITNESS is enabled. Begin cleaning up some machdep/mutex.h code - specifically updated the "optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently need those. Finally, caught up to the interface changes in all sys code. Contributors: jake, jhb, jasone (in no particular order)
# 71999	04-Feb-2001	phk	Mechanical change to use <sys/queue.h> macro API instead of fondling implementation details. Created with: sed(1) Reviewed by: md5(1)
# 71860	31-Jan-2001	bp	Properly lock new vnode. Reminded by: tegge
# 71576	24-Jan-2001	jasone	Convert all simplelocks to mutexes and remove the simplelock implementations.
# 71411	23-Jan-2001	rwatson	o The move to using VADMIN under vaccess() resulted in some system calls returning EACCES instead of EPERM. This patch modifies vaccess() to return EPERM instead of EACCES if VADMIN is among the requested rights. This affects functions normally limited to the owners of a file, such as chmod(), as EPERM is the error indicating that privilege would allow the operation, rather than a chance in mandatory or discretionary rights. Reported by: bde
# 70063	15-Dec-2000	jhb	Stick the kthread API in a kthread_* namespace, and the specialized kproc functions in a kproc_* namespace. Reviewed by: -arch
# 69950	13-Dec-2000	mckusick	Use proper mutex locking when calling setrunnable from speedup_syncer(). Submitted by: Tor.Egge@fast.no
# 69781	08-Dec-2000	dwmalone	Convert more malloc+bzero to malloc+M_ZERO. Submitted by: josh@zipperup.org Submitted by: Robert Drehmel <robd@gmx.net>
# 69664	06-Dec-2000	peter	Untangle vfsinit() a bit. Use seperate sysinit functions rather than having a super-function calling bits all over the place.
# 69529	02-Dec-2000	gallatin	Correct int/long type mismatch in the proper place this time. freevnodes and numvnodes are longs in the kernel. They should remain longs in systat, what really needs to change is that they should be using SYSCTL_LONG rather than SYSCTL_INT. I also changed wantfreevnodes to SYSCTL_LONG because I happened to notice it. I wish there was a way to find all of these automatically.. Pointed out by: bde
# 69436	01-Dec-2000	jhb	Use msleep() instead of mtx_exit()/tsleep() so that we release the lock and go to sleep as an "atomic" operation.
# 69400	30-Nov-2000	mckusick	Get rid of a bogus mtx_exit (it was attempting to release an already released mutex). Submitted by: "Chris Knight" <chris@aims.com.au>
# 68885	18-Nov-2000	dillon	Implement a low-memory deadlock solution. Removed most of the hacks that were trying to deal with low-memory situations prior to now. The new code is based on the concept that I/O must be able to function in a low memory situation. All major modules related to I/O (except networking) have been adjusted to allow allocation out of the system reserve memory pool. These modules now detect a low memory situation but rather then block they instead continue to operate, then return resources to the memory pool instead of cache them or leave them wired. Code has been added to stall in a low-memory situation prior to a vnode being locked. Thus situations where a process blocks in a low-memory condition while holding a locked vnode have been reduced to near nothing. Not only will I/O continue to operate, but many prior deadlock conditions simply no longer exist. Implement a number of VFS/BIO fixes (found by Ian): in biodone(), bogus-page replacement code, the loop was not properly incrementing loop variables prior to a continue statement. We do not believe this code can be hit anyway but we aren't taking any chances. We'll turn the whole section into a panic (as it already is in brelse()) after the release is rolled. In biodone(), the foff calculation was incorrectly clamped to the iosize, causing the wrong foff to be calculated for pages in the case of an I/O error or biodone() called without initiating I/O. The problem always caused a panic before. Now it doesn't. The problem is mainly an issue with NFS. Fixed casts for ~PAGE_MASK. This code worked properly before only because the calculations use signed arithmatic. Better to properly extend PAGE_MASK first before inverting it for the 64 bit masking op. In brelse(), the bogus_page fixup code was improperly throwing away the original contents of 'm' when it did the j-loop to fix the bogus pages. The result was that it would potentially invalidate parts of the WRONG page(!), leading to corruption. There may still be cases where a background bitmap write is being duplicated, causing potential corruption. We have identified a potentially serious bug related to this but the fix is still TBD. So instead this patch contains a KASSERT to detect the problem and panic the machine rather then continue to corrupt the filesystem. The problem does not occur very often.. it is very hard to reproduce, and it may or may not be the cause of the corruption people have reported. Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>) Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
# 68262	02-Nov-2000	tegge	Clear the VFREE flag when the vnode is removed from the free list in getnewvnode(). Otherwise routines called from VOP_INACTIVE() might attempt to remove the vnode from a free list the vnode isn't on, causing corruption. PR: 18012
# 68259	02-Nov-2000	phk	Take VBLK devices further out of their missery. This should fix the panic I introduced in my previous commit on this topic.
# 67365	20-Oct-2000	jhb	Catch up to moving headers: - machine/ipl.h -> sys/ipl.h - machine/mutex.h -> sys/mutex.h
# 67309	19-Oct-2000	rwatson	o Introduce new VOP_ACCESS() flag VADMIN, allowing file systems to perform "administrative" authorization checks. In most cases, the VADMIN test checks to make sure the credential effective uid is the same as the file owner. o Modify vaccess() to set VADMIN as an available right if the uid is appropriate. o Modify references to uid-based access control operations such that they now always invoke VOP_ACCESS() instead of using hard-coded policy checks. o This allows alternative UFS policies to be implemented by replacing only ufs_access() (such as mandatory system policies). o VOP_ACCESS() requires the caller to hold an exclusive vnode lock on the vnode: I believe that new invocations of VOP_ACCESS() are always called with the lock held. o Some direct checks of the uid remain, largely associated with the QUOTA and SUIDDIR code. Reviewed by: eivind Obtained from: TrustedBSD Project
# 66886	09-Oct-2000	eivind	Blow away the v_specmountpoint define, replacing it with what it was defined as (rdev->si_mountpoint)
# 66720	06-Oct-2000	jasone	Do not call lockdestroy() for v_vnlock, which may point to a lock in a deeper vfs stacking layer. Submitted by: bp
# 66686	05-Oct-2000	eivind	Style fixes based on comments by bde
# 66615	04-Oct-2000	jasone	Convert lockmgr locks from using simple locks to using mutexes. Add lockdestroy() and appropriate invocations, which corresponds to lockinit() and must be called to clean up after a lockmgr lock is no longer needed.
# 66541	02-Oct-2000	bp	Move KASSERTs which checks value of v_usecount after vnode locking, so it will not produce wrong alarms.
# 66411	27-Sep-2000	mckusick	Do the right thing if bdevvp is called twice for the same device. Obtained from: Poul-Henning Kamp <phk@freebsd.org>
# 66355	25-Sep-2000	bp	Add a lock structure to vnode structure. Previously it was either allocated separately (nfs, cd9660 etc) or keept as a first element of structure referenced by v_data pointer(ffs). Such organization leads to known problems with stacked filesystems. From this point vop_nolock() functions maintain only interlock lock. vop_stdlock() functions maintain built-in v_lock structure using lockmgr(). vop_sharedlock() is compatible with vop_stdunlock(), but maintains a shared lock on vnode. If filesystem wishes to export lockmgr compatible lock, it can put an address of this lock to v_vnlock field. This indicates that the upper filesystem can take advantage of it and use single lock structure for entire (or part) of stack of vnodes. This field shouldn't be examined or modified by VFS code except for initialization purposes. Reviewed in general by: mckusick
# 66244	22-Sep-2000	eivind	Style fixes: * Add lots of comments * Convert a couple of assertions to KASSERT() * Minimal whitespace & misapplied {} fixes * Convert #if 0 to #if COMPILING_LINT for code we presently do not support, but want to keep available. Reviewed by: adrian, markm
# 66242	22-Sep-2000	eivind	Staticize addalias()
# 66168	21-Sep-2000	alfred	comment vfs_export functions, requested by: eivind
# 66130	20-Sep-2000	rwatson	o Add additional comment describing vaccess() behavior. Requested by: eivind Reviewed by: eivind, adrian
# 66067	19-Sep-2000	phk	Rename lminor() to dev2unit(). This function gives a linear unit number which hides the 'hole' in the minor bits. Introduce unit2minor() to do the reverse operation. Fix some some make_dev() calls which didn't use UID_* or GID_* macros. Kill the v_hashchain alias macro, it hides the real relationship. Introduce experimental SI_CHEAPCLONE flag set it on cloned bpfs.
# 65770	12-Sep-2000	bp	Add three new VOPs: VOP_CREATEVOBJECT, VOP_DESTROYVOBJECT and VOP_GETVOBJECT. They will be used by nullfs and other stacked filesystems to support full cache coherency. Reviewed in general by: mckusick, dillon
# 65557	07-Sep-2000	jasone	Major update to the way synchronization is done in the kernel. Highlights include: * Mutual exclusion is used instead of spl(). See mutex(9). (Note: The alpha port is still in transition and currently uses both.) Per-CPU idle processes. * Interrupts are run in their own separate kernel threads and can be preempted (i386 only). Partially contributed by: BSDi (BSD/OS) Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh
# 65516	06-Sep-2000	rwatson	o Synchronize vaccess() capability access control checks with TrustedBSD tree. Obtained from: TrustedBSD Project
# 65492	05-Sep-2000	phk	Move extern declaration of dead_vnodeop_p to a .h file. Remove race condition in vn_isdisk().
# 65200	29-Aug-2000	rwatson	o Restructure vaccess() so as to check for DAC permission to modify the object before falling back on privilege. Make vaccess() accept an additional optional argument, privused, to determine whether privilege was required for vaccess() to return 0. Add commented out capability checks for reference. Rename some variables to make it more clear which modes/uids/etc are associated with the object, and which with the access mode. o Update file system use of vaccess() to pass NULL as the optional privused argument. Once additional patches are applied, suser() will no longer set ASU, so privused will permit passing of privilege information up the stack to the caller. Reviewed by: bde, green, phk, -security, others Obtained from: TrustedBSD Project
# 64875	20-Aug-2000	phk	Fix typo in last commit.
# 64865	20-Aug-2000	phk	Centralize the canonical vop_access user/group/other check in vaccess(). Discussed with: bde
# 63788	24-Jul-2000	mckusick	This patch corrects the first round of panics and hangs reported with the new snapshot code. Update addaliasu to correctly implement the semantics of the old checkalias function. When a device vnode first comes into existence, check to see if an anonymous vnode for the same device was created at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than creating a new vnode for the device. This corrects a problem which caused the kernel to panic when taking a snapshot of the root filesystem. Change the calling convention of vn_write_suspend_wait() to be the same as vn_start_write(). Split out softdep_flushworklist() from softdep_flushfiles() so that it can be used to clear the work queue when suspending filesystem operations. Access to buffers becomes recursive so that snapshots can recursively traverse their indirect blocks using ffs_copyonwrite() when checking for the need for copy on write when flushing one of their own indirect blocks. This eliminates a deadlock between the syncer daemon and a process taking a snapshot. Ensure that softdep_process_worklist() can never block because of a snapshot being taken. This eliminates a problem with buffer starvation. Cleanup change in ffs_sync() which did not synchronously wait when MNT_WAIT was specified. The result was an unclean filesystem panic when doing forcible unmount with heavy filesystem I/O in progress. Return a zero'ed block when reading a block that was not in use at the time that a snapshot was taken. Normally, these blocks should never be read. However, the readahead code will occationally read them which can cause unexpected behavior. Clean up the debugging code that ensures that no blocks be written on a filesystem while it is suspended. Snapshots must explicitly label the blocks that they are writing during the suspension so that they do not cause a `write on suspended filesystem' panic. Reorganize ffs_copyonwrite() to eliminate a deadlock and also to prevent a race condition that would permit the same block to be copied twice. This change eliminates an unexpected soft updates inconsistency in fsck caused by the double allocation. Use bqrelse rather than brelse for buffers that will be needed soon again by the snapshot code. This improves snapshot performance.
# 62976	11-Jul-2000	mckusick	Add snapshots to the fast filesystem. Most of the changes support the gating of system calls that cause modifications to the underlying filesystem. The gating can be enabled by any filesystem that needs to consistently suspend operations by adding the vop_stdgetwritemount to their set of vnops. Once gating is enabled, the function vfs_write_suspend stops all new write operations to a filesystem, allows any filesystem modifying system calls already in progress to complete, then sync's the filesystem to disk and returns. The function vfs_write_resume allows the suspended write operations to begin again. Gating is not added by default for all filesystems as for SMP systems it adds two extra locks to such critical kernel paths as the write system call. Thus, gating should only be added as needed. Details on the use and current status of snapshots in FFS can be found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness is not included here. Unless and until you create a snapshot file, these changes should have no effect on your system (famous last words).
# 62776	07-Jul-2000	bp	Fix support for more than 256 simultaneous mounts. Theoretical limit is 2^16 mounts per fs type. Reported by: Troy Arie Cobb <tcobb@staff.circle.net> via phk Reviewed by: bde
# 62573	04-Jul-2000	phk	Previous commit changing SYSCTL_HANDLER_ARGS violated KNF. Pointed out by: bde
# 62552	04-Jul-2000	mckusick	Simplify and rationalise the management of the vnode free list (preparing the code to add snapshots).
# 62549	04-Jul-2000	mckusick	If a buffer flush fails when trying to reclaim a vnode, it is too late to save the vnode, so just toss any remaining unwritten buffers rather than leaving them lying around to make trouble in the future.
# 62469	03-Jul-2000	phk	Make the two calls from kern/* into softupdates #ifdef SOFTUPDATES, that is way cleaner than using the softupdates_stub stunt, which should be killed when convenient. Discussed with: mckusick
# 62454	03-Jul-2000	phk	Style police catches up with rev 1.26 of src/sys/sys/sysctl.h: Sanitize SYSCTL_HANDLER_ARGS so that simplistic tools can grog our sources: -sysctl_vm_zone SYSCTL_HANDLER_ARGS +sysctl_vm_zone (SYSCTL_HANDLER_ARGS)
# 62148	27-Jun-2000	phk	Move prtactive to vfs from ufs. It is used all over the place.
# 61724	16-Jun-2000	phk	Virtualizes & untangles the bioops operations vector. Ref: Message-ID: <18317.961014572@critter.freebsd.dk> To: current@
# 60938	26-May-2000	jake	Back out the previous change to the queue(3) interface. It was not discussed and should probably not happen. Requested by: msmith and others
# 60833	23-May-2000	jake	Change the way that the queue(3) structures are declared; don't assume that the type argument to _HEAD and _ENTRY is a struct. Suggested by: phk Reviewed by: phk Approved by: mdodd
# 60539	14-May-2000	asmodai	Fix the rootmount code for now. This function will probably rewritten/renamed to devpp. Submitted by: Assar Westerlund <assar@sics.se> on -current Confirmed to work: Steinar Haug <sthaug@nethelp.no>, Manfred Antar <mantar@pacbell.net> Reviewed by: phk
# 60041	05-May-2000	phk	Separate the struct bio related stuff out of <sys/buf.h> into <sys/bio.h>. <sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall not be made a nested include according to bdes teachings on the subject of nested includes. Diskdrivers and similar stuff below specfs::strategy() should no longer need to include <sys/buf.> unless they need caching of data. Still a few bogus uses of struct buf to track down. Repocopy by: peter
# 58349	20-Mar-2000	phk	Rename the existing BUF_STRATEGY() to DEV_STRATEGY() substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo) substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo) This patch is machine generated except for the ccd.c and buf.h parts.
# 58185	18-Mar-2000	chris	In vn_isdisk(), check whether vp->v_rdev is NULL. If it is, then return ENXIO (Device not configured). Without this, vn_isdisk() could (and did in the case of lstat() under fdesc) pass a NULL pointer to devsw(), which caused a page fault. Reviewed by: alfred
# 58132	16-Mar-2000	phk	Eliminate the undocumented, experimental, non-delivering and highly dangerous MAX_PERF option.
# 58059	14-Mar-2000	bde	Don't try so hard to make the lower 16 bits of fsids unique. It tended to recycle full fsids after only 16 mount/unmount's. This is probably too often for exported fsids. Now we recycle the full fsids only after 2^16 mount/ umount's and only ensure uniqueness in the lower 16 bits if there have been <= 256 calls to vfs_getnewfsid() since the system started.
# 57931	12-Mar-2000	bde	Try harder to make the lower 16 bits of fsids unique. The vfs type number was packed very wastefully, giving perfect non-uniqeness in the lower 16 bits of fsids for filesystems with the same vfs type. This made linux_stat() return perfectly non-unique (broken) 16-bit st_dev's for nfs mount points, and effectively reduced mntid_base to 8 bits so that the vfs_getnewfsid() looped endlessly when there are already 256 mounted filesystems with the required vfs type. Approved by: jkh
# 57025	07-Feb-2000	sos	Do refcounting of open devices (more) correctly. count_dev funtion by phk.
# 56949	02-Feb-2000	rwatson	Remove static qualifier from vgonel, as it is needed by the Arla folk outside of vfs_subr.c. Submitted by: Assar Westerlund <assar@sics.se> Reviewed by: rwatson Approved by: jkh
# 56837	29-Jan-2000	rwatson	This patch fixes a locking bug that can result in deadlock if the codepath is followed. From the PR: vclean calls vrele leading to deadlock (if usecount > 0) vclean() calls vrele() if v_usecount of the node was higher than one. But before calling it, it sets the VXLOCK flag, which will make vn_lock called from vrele dead-lock. PR: kern/15117 Submitted by: Assar Westerlund <assar@stacken.kth.se> Reviewed by: rwatson Obtained from: NetBSD
# 55756	10-Jan-2000	phk	Give vn_isdisk() a second argument where it can return a suitable errno. Suggested by: bde
# 55695	10-Jan-2000	mckusick	Remove the P_BUFEXHAUST flag from the syncer process (leaving it only on the buf_daemon process). The problem is that when the syncer process starts running the worklist, it wants to delete lots of files. It does this by VFS_VGET'ing the vnodes, clearing the blocks in them and bdwrite'ing the buffer. It can process close to a thousand files per second which generates a large number of dirty buffers. So, giving it special priviledge at the buffer trough leads to trouble as the buf_daemon does occationally need a free buffer to proceed and if the syncer has used every last one up, we are toast.
# 55611	08-Jan-2000	eivind	Change NDFREE() from a macro to a function for the time being; the macro version caused intolerable bloat (30k). I'm likely to revisit this with an attempt at a smarter macro. Bloat noticed by: bde
# 55539	07-Jan-2000	luoqi	Introduce a mechanism to suspend/resume system processes. Suspend syncer and bufdaemon prior to disk sync during system shutdown.
# 55431	05-Jan-2000	dillon	Enhance reassignbuf(). When a buffer cannot be time-optimally inserted into vnode dirtyblkhd we append it to the list instead of prepend it to the list in order to maintain a 'forward' locality of reference, which is arguably better then 'reverse'. The original algorithm did things this way to but at a huge time cost. Enhance the append interlock for NFS writes to handle intr/soft mounts better. Fix the hysteresis for NFS async daemon I/O requests to reduce the number of unnecessary context switches. Modify handling of NFS mount options. Any given user option that is too high now defaults to the kernel maximum for that option rather then the kernel default for that option. Reviewed by: Alfred Perlstein <bright@wintelcom.net>
# 54989	22-Dec-1999	mckusick	Prettyness police: Identify flags in b_xflags with BX_ to distinguish them from flags in b_flags which are prefixed with B_
# 54467	12-Dec-1999	dillon	Add MAP_NOSYNC feature to mmap(), and MADV_NOSYNC and MADV_AUTOSYNC to madvise(). This feature prevents the update daemon from gratuitously flushing dirty pages associated with a mapped file-backed region of memory. The system pager will still page the memory as necessary and the VM system will still be fully coherent with the filesystem. Modifications made by other means to the same area of memory, for example by write(), are unaffected. The feature works on a page-granularity basis. MAP_NOSYNC allows one to use mmap() to share memory between processes without incuring any significant filesystem overhead, putting it in the same performance category as SysV Shared memory and anonymous memory. Reviewed by: julian, alc, dg
# 54444	11-Dec-1999	eivind	Lock reporting and assertion changes. * lockstatus() and VOP_ISLOCKED() gets a new process argument and a new return value: LK_EXCLOTHER, when the lock is held exclusively by another process. * The ASSERT_VOP_(UN)LOCKED family is extended to use what this gives them * Extend the vnode_if.src format to allow more exact specification than locked/unlocked. This commit should not do any semantic changes unless you are using DEBUG_VFS_LOCKS. Discussed with: grog, mch, peter, phk Reviewed by: peter
# 53900	29-Nov-1999	dillon	Remove vfs_getrootfsid() function (a temporary hack added a few months ago to make BOOTP work again). It is no longer required by BOOTP and no longer used.
# 53577	22-Nov-1999	phk	Convert various pieces of code to use vn_isdisk() rather than checking for vp->v_type == VBLK. In ccd: we don't need to call VOP_GETATTR to find the type of a vnode. Reviewed by: sos
# 53452	20-Nov-1999	phk	struct mountlist and struct mount.mnt_list have no business being a CIRCLEQ. Change them to TAILQ_HEAD and TAILQ_ENTRY respectively. This removes ugly mp != (void*)&mountlist comparisons. Requested by: phk Submitted by: Jake Burkholder jake@checker.org PR: 14967
# 53225	16-Nov-1999	phk	Commit the remaining part of PR14914: Alot of the code in sys/kern directly accesses the Q_HEAD and Q_ENTRY structures for list operations. This patch makes all list operations in sys/kern use the queue(3) macros, rather than directly accessing the *Q_{HEAD,ENTRY} structures. Reviewed by: phk Submitted by: Jake Burkholder <jake@checker.org> PR: 14914
# 53059	09-Nov-1999	phk	Next step in the device cleanup process. Correctly lock vnodes when calling VOP_OPEN() from filesystem mount code. Unify spec_open() for bdev and cdev cases. Remove the disabled bdev specific read/write code.
# 52635	29-Oct-1999	phk	useracc() the prequel: Merge the contents (less some trivial bordering the silly comments) of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts the #defines for the vm_inherit_t and vm_prot_t types next to their typedefs. This paves the road for the commit to follow shortly: change useracc() to use VM_PROT_{READ\|WRITE} rather than B_{READ\|WRITE} as argument.
# 52128	11-Oct-1999	peter	Trim unused options (or #ifdef for undoc options). Submitted by: phk
# 51926	04-Oct-1999	phk	Move the buffered read/write code out of spec_{read\|write} and into two new functions spec_buf{read\|write}. Add sysctl vfs.bdev_buffered which defaults to 1 == true. This sysctl can be used to experimentally turn buffered behaviour for bdevs off. I should not be changed while any blockdevices are open. Remove the misplaced sysctl vfs.enable_userblk_io. No other changes in behaviour.
# 51797	29-Sep-1999	phk	Remove v_maxio from struct vnode. Replace it with mnt_iosize_max in struct mount. Nits from: bde
# 51488	21-Sep-1999	dillon	Final commit to remove vnode->v_lastr. vm_fault now handles read clustering issues (replacing code that used to be in ufs/ufs/ufs_readwrite.c). vm_fault also now uses the new VM page counter inlines. This completes the changeover from vnode->v_lastr to vm_entry_t->v_lastr for VM, and fp->f_nextread and fp->f_seqcount (which have been in the tree for a while). Determination of the I/O strategy (sequential, random, and so forth) is now handled on a descriptor-by-descriptor basis for base I/O calls, and on a memory-region-by-memory-region and process-by-process basis for VM faults. Reviewed by: David Greenman <dg@root.com>, Alan Cox <alc@cs.rice.edu>
# 51478	20-Sep-1999	phk	Initialize vp->v_maxio to its default in getnetvnode() rather than four different places in vfs_cluster.c
# 51388	19-Sep-1999	dillon	Fix BOOTP root FS mounts. Also cleanup vfs_getnewfsid() and collapse addaliasu() into addalias() (no operational change) and clarify comments relating to a trick that vclean() uses. The fix to BOOTP is yet another hack. Actually, rootfsid handling is already a major hack. The whole thing needs to be cleaned up. Reviewed by: David Greenman <dg@root.com>, Alan Cox <alc@cs.rice.edu>
# 51345	17-Sep-1999	dillon	Add vfs.enable_userblk_io sysctl to control whether user reads and writes to buffered block devices are allowed. The default is to be backwards compatible, i.e. reads and writes are allowed. The idea is for a larger crowd to start running with this disabled and see what problems, if any, crop up, and then to change the default to off and see if any problems crop up in the next 6 months prior to potentially removing support entirely. There are still a few people, Julian and myself included, who believe the buffered block device access from usermode to be useful. Remove use of vnode->v_lastr from buffered block device I/O in preparation for removal of vnode->v_lastr field, replacing it with the already existing seqcount metric to detect sequential operation. Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>
# 50549	29-Aug-1999	phk	Add dev_t freeing code. Controlled by sysctl debug.free_devt, default is off.
# 50521	28-Aug-1999	phk	remove unused variables.
# 50477	28-Aug-1999	peter	$Id$ -> $FreeBSD$
# 50405	26-Aug-1999	phk	Simplify the handling of VCHR and VBLK vnodes using the new dev_t: Make the alias list a SLIST. Drop the "fast recycling" optimization of vnodes (including the returning of a prexisting but stale vnode from checkalias). It doesn't buy us anything now that we don't hardlimit vnodes anymore. Rename checkalias2() and checkalias() to addalias() and addaliasu() - which takes dev_t and udev_t arg respectively. Make the revoke syscalls use vcount() instead of VALIASED. Remove VALIASED flag, we don't need it now and it is faster to traverse the much shorter lists than to maintain the flag. vfs_mountedon() can check the dev_t directly, all the vnodes point to the same one. Print the devicename in specfs/vprint(). Remove a couple of stale LFS vnode flags. Remove unimplemented/unused LK_DRAINED;
# 50347	25-Aug-1999	phk	Introduce vn_isdisk(struct vnode *vp) function, and use it to test for diskness.
# 50334	25-Aug-1999	julian	Make DEVFS use PHK's specinfo struct as the source of dev_t and devsw. In lookup() however it's the other way around as we need to supply the dev_t for the vnode, so devfs still has a copy of it stashed away. Sourcing it from the vnode in the vnops however is useful as it makes a lot of the code almost the same as that in specfs.
# 50137	22-Aug-1999	jdp	Support full-precision file timestamps. Until now, only the seconds have been maintained, and that is still the default. A new sysctl variable "vfs.timestamp_precision" can be used to enable higher levels of precision: 0 = seconds only; nanoseconds zeroed (default). 1 = seconds and nanoseconds, accurate within 1/HZ. 2 = seconds and nanoseconds, truncated to microseconds. >=3 = seconds and nanoseconds, maximum precision. Level 1 uses getnanotime(), which is fast but can be wrong by up to 1/HZ. Level 2 uses microtime(). It might be desirable for consistency with utimes() and friends, which take timeval structures rather than timespecs. Level 3 uses nanotime() for the higest precision. I benchmarked levels 0, 1, and 3 by copying a 550 MB tree with "cpio -pdu". There was almost negligible difference in the system times -- much less than 1%, and less than the variation among multiple runs at the same level. Bruce Evans dreamed up a torture test involving 1-byte reads with intervening fstat() calls, but the cpio test seems more realistic to me. This feature is currently implemented only for the UFS (FFS and MFS) filesystems. But I think it should be easy to support it in the others as well. An earlier version of this was reviewed by Bruce. He's not to blame for any breakage I've introduced since then. Reviewed by: bde (an earlier version of the code)
# 49679	13-Aug-1999	phk	The bdevsw() and cdevsw() are now identical, so kill the former.
# 49678	13-Aug-1999	phk	s/v_specinfo/v_rdev/
# 49535	08-Aug-1999	phk	Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>, a few lines into <sys/vnode.h>. Add a few fields to struct specinfo, paving the way for the fun part.
# 49101	26-Jul-1999	alc	Add sysctl and support code to allow directories to be VMIO'd. The default setting for the sysctl is OFF, which is the historical operation. Submitted by: dillon
# 48936	20-Jul-1999	phk	Now a dev_t is a pointer to struct specinfo which is shared by all specdev vnodes referencing this device. Details: cdevsw->d_parms has been removed, the specinfo is available now (== dev_t) and the driver should modify it directly when applicable, and the only driver doing so, does so: vn.c. I am not sure the logic in checking for "<" was right before, and it looks even less so now. An intial pool of 50 struct specinfo are depleted during early boot, after that malloc had better work. It is likely that fewer than 50 would do. Hashing is done from udev_t to dev_t with a prime number remainder hash, experiments show no better hash available for decent cost (MD5 is only marginally better) The prime number used should not be close to a power of two, we use 83 for now. Add new checkalias2() to get around the loss of info from dev2udev() in bdevvp(); The aliased vnodes are hung on a list straight of the dev_t, and speclisth[SPECSZ] is unused. The sharing of struct specinfo means that the v_specnext moves into the vnode which grows by 4 bytes. Don't use a VBLK dev_t which doesn't make sense in MFS, now we hang a dummy cdevsw on B/Cmaj 253 so that things look sane. Storage overhead from all of this is O(50k). Bump __FreeBSD_version to 400009 The next step will add the stuff needed so device-drivers can start to hang things from struct specinfo
# 48892	19-Jul-1999	phk	[click] Now all dev_t's in the kernel have their char device major. Only know casualy of this is swapinfo/pstat which should be fixes the right way: Store the actual pathname in the kernel like mount does. [Volounteers sought for this task] The road map from here is roughly: expand struct specinfo into struct based dev_t. Add dev_t registration facilities for device drivers and start to use them.
# 48884	18-Jul-1999	phk	Introduce the vn_todev(struct vnode*) function, which returns the dev_t corresponding to a VBLK or VCHR node, or NODEV.
# 48863	17-Jul-1999	phk	Fix 2nd arg to udev2dev().
# 48859	17-Jul-1999	phk	I have not one single time remembered the name of this function correctly so obviously I gave it the wrong name. s/umakedev/makeudev/g
# 48777	12-Jul-1999	kris	Correct a couple of spelling errors in comments.
# 48677	08-Jul-1999	mckusick	These changes appear to give us benefits with both small (32MB) and large (1G) memory machine configurations. I was able to run 'dbench 32' on a 32MB system without bring the machine to a grinding halt. * buffer cache hash table now dynamically allocated. This will have no effect on memory consumption for smaller systems and will help scale the buffer cache for larger systems. * minor enhancement to pmap_clearbit(). I noticed that all the calls to it used constant arguments. Making it an inline allows the constants to propogate to deeper inlines and should produce better code. * removal of inherent vfs_ioopt support through the emplacement of appropriate #ifdef's, with John's permission. If we do not find a use for it by the end of the year we will remove it entirely. * removal of getnewbufloops* counters & sysctl's - no longer necessary for debugging, getnewbuf() is now optimal. * buffer hash table functions removed from sys/buf.h and localized to vfs_bio.c * VFS_BIO_NEED_DIRTYFLUSH flag and support code added ( bwillwrite() ), allowing processes to block when too many dirty buffers are present in the system. * removal of a softdep test in bdwrite() that is no longer necessary now that bdwrite() no longer attempts to flush dirty buffers. * slight optimization added to bqrelse() - there is no reason to test for available buffer space on B_DELWRI buffers. * addition of reverse-scanning code to vfs_bio_awrite(). vfs_bio_awrite() will attempt to locate clusterable areas in both the forward and reverse direction relative to the offset of the buffer passed to it. This will probably not make much of a difference now, but I believe we will start to rely on it heavily in the future if we decide to shift some of the burden of the clustering closer to the actual I/O initiation. * Removal of the newbufcnt and lastnewbuf counters that Kirk added. They do not fix any race conditions that haven't already been fixed by the gbincore() test done after the only call to getnewbuf(). getnewbuf() is a static, so there is no chance of it being misused by other modules. ( Unless Kirk can think of a specific thing that this code fixes. I went through it very carefully and didn't see anything ). * removal of VOP_ISLOCKED() check in flushbufqueues(). I do not think this check is necessary, the buffer should flush properly whether the vnode is locked or not. ( yes? ). * removal of extra arguments passed to getnewbuf() that are not necessary. * missed cluster_wbuild() that had to be a cluster_wbuild_wb() in vfs_cluster.c * vn_write() now calls bwillwrite() PRIOR to locking the vnode, which should greatly aid flushing operations in heavy load situations - both the pageout and update daemons will be able to operate more efficiently. * removal of b_usecount. We may add it back in later but for now it is useless. Prior implementations of the buffer cache never had enough buffers for it to be useful, and current implementations which make more buffers available might not benefit relative to the amount of sophistication required to implement a b_usecount. Straight LRU should work just as well, especially when most things are VMIO backed. I expect that (even though John will not like this assumption) directories will become VMIO backed some point soon. Submitted by: Matthew Dillon <dillon@backplane.com> Reviewed by: Kirk McKusick <mckusick@mckusick.com>
# 48544	04-Jul-1999	mckusick	The buffer queue mechanism has been reformulated. Instead of having QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN, QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean and dirty buffers have been separated. Empty buffers with KVM assignments have been separated from truely empty buffers. getnewbuf() has been rewritten and now operates in a 100% optimal fashion. That is, it is able to find precisely the right kind of buffer it needs to allocate a new buffer, defragment KVM, or to free-up an existing buffer when the buffer cache is full (which is a steady-state situation for the buffer cache). Buffer flushing has been reorganized. Previously buffers were flushed in the context of whatever process hit the conditions forcing buffer flushing to occur. This resulted in processes blocking on conditions unrelated to what they were doing. This also resulted in inappropriate VFS stacking chains due to multiple processes getting stuck trying to flush dirty buffers or due to a single process getting into a situation where it might attempt to flush buffers recursively - a situation that was only partially fixed in prior commits. We have added a new daemon called the buf_daemon which is responsible for flushing dirty buffers when the number of dirty buffers exceeds the vfs.hidirtybuffers limit. This daemon attempts to dynamically adjust the rate at which dirty buffers are flushed such that getnewbuf() calls (almost) never block. The number of nbufs and amount of buffer space is now scaled past the 8MB limit that was previously imposed for systems with over 64MB of memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed somewhat. The number of physical buffers has been increased with the intention that we will manage physical I/O differently in the future. reassignbuf previously attempted to keep the dirtyblkhd list sorted which could result in non-deterministic operation under certain conditions, such as when a large number of dirty buffers are being managed. This algorithm has been changed. reassignbuf now keeps buffers locally sorted if it can do so cheaply, and otherwise gives up and adds buffers to the head of the dirtyblkhd list. The new algorithm is deterministic but not perfect. The new algorithm greatly reduces problems that previously occured when write_behind was turned off in the system. The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive P_BUFEXHAUST bit. This bit allows processes working with filesystem buffers to use available emergency reserves. Normal processes do not set this bit and are not allowed to dig into emergency reserves. The purpose of this bit is to avoid low-memory deadlocks. A small race condition was fixed in getpbuf() in vm/vm_pager.c. Submitted by: Matthew Dillon <dillon@apollo.backplane.com> Reviewed by: Kirk McKusick <mckusick@mckusick.com>
# 48468	02-Jul-1999	phk	Make sure that stat(2) and friends always return a valid st_dev field. Pseudo-FS need not fill in the va_fsid anymore, the syscall code will use the first half of the fsid, which now looks like a udev_t with major 255.
# 48391	01-Jul-1999	peter	Slight reorganization of kernel thread/process creation. Instead of using SYSINIT_KT() etc (which is a static, compile-time procedure), use a NetBSD-style kthread_create() interface. kproc_start is still available as a SYSINIT() hook. This allowed simplification of chunks of the sysinit code in the process. This kthread_create() is our old kproc_start internals, with the SYSINIT_KT fork hooks grafted in and tweaked to work the same as the NetBSD one. One thing I'd like to do shortly is get rid of nfsiod as a user initiated process. It makes sense for the nfs client code to create them on the fly as needed up to a user settable limit. This means that nfsiod doesn't need to be in /sbin and is always "available". This is a fair bit easier to do outside of the SYSINIT_KT() framework.
# 48225	26-Jun-1999	mckusick	Convert buffer locking from using the B_BUSY and B_WANTED flags to using lockmgr locks. This commit should be functionally equivalent to the old semantics. That is, all buffer locking is done with LK_EXCLUSIVE requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will be done in future commits.
# 47964	16-Jun-1999	mckusick	Add a vnode argument to VOP_BWRITE to get rid of the last vnode operator special case. Delete special case code from vnode_if.sh, vnode_if.src, umap_vnops.c, and null_vnops.c.
# 47940	15-Jun-1999	mckusick	Get rid of the global variable rushjob and replace it with a function in kern/vfs_subr.c named speedup_syncer() which handles the speedup request. Change the various clients of rushjob to use the new function.
# 47640	31-May-1999	phk	Simplify cdevsw registration. The cdevsw_add() function now finds the major number(s) in the struct cdevsw passed to it. cdevsw_add_generic() is no longer needed, cdevsw_add() does the same thing. cdevsw_add() will print an message if the d_maj field looks bogus. Remove nblkdev and nchrdev variables. Most places they were used bogusly. Instead check a dev_t for validity by seeing if devsw() or bdevsw() returns NULL. Move bdevsw() and devsw() functions to kern/kern_conf.c Bump __FreeBSD_version to 400006 This commit removes: 72 bogus makedev() calls 26 bogus SYSINIT functions if_xe.c bogusly accessed cdevsw[], author/maintainer please fix. I4b and vinum not changed. Patches emailed to authors. LINT probably broken until they catch up.
# 47445	24-May-1999	jb	Remove the test for bdevsw(dev) == NULL from bdevvp() because it fails if there is no character device associated with the block device. In this case that doesn't matter because bdevvp() doesn't use the character device structure. I can use the pointy bit of the axe too.
# 47202	14-May-1999	luoqi	Legally acquire a major number for mfs.
# 47132	14-May-1999	mckusick	Previously directories were sync'ed every 10 seconds while bitmaps & inodes were synced every 15 seconds. This is now reversed as during directory create, we cannot commit the directory entry until its inode has been written. With this switch, the inodes will be more likely to be written by the time that the directory is written thus reducing the number of directory rollbacks that are needed.
# 47075	12-May-1999	peter	Fix (?) SPECHASH dev_t/major/minor/etc args
# 47065	12-May-1999	phk	Don't peek into dev_t
# 47028	11-May-1999	phk	Divorce "dev_t" from the "major\|minor" bitmap, which is now called udev_t in the kernel but still called dev_t in userland. Provide functions to manipulate both types: major() umajor() minor() uminor() makedev() umakedev() dev2udev() udev2dev() For now they're functions, they will become in-line functions after one of the next two steps in this process. Return major/minor/makedev to macro-hood for userland. Register a name in cdevsw[] for the "filedescriptor" driver. In the kernel the udev_t appears in places where we have the major/minor number combination, (ie: a potential device: we may not have the driver nor the device), like in inodes, vattr, cdevsw registration and so on, whereas the dev_t appears where we carry around a reference to a actual device. In the future the cdevsw and the aliased-from vnode will be hung directly from the dev_t, along with up to two softc pointers for the device driver and a few houskeeping bits. This will essentially replace the current "alias" check code (same buck, bigger bang). A little stunt has been provided to try to catch places where the wrong type is being used (dev_t vs udev_t), if you see something not working, #undef DEVT_FASCIST in kern/kern_conf.c and see if it makes a difference. If it does, please try to track it down (many hands make light work) or at least try to reproduce it as simply as possible, and describe how to do that. Without DEVT_FASCIST I belive this patch is a no-op. Stylistic/posixoid comments about the userland view of the <sys/*.h> files welcome now, from userland they now contain the end result. Next planned step: make all dev_t's refer to the same devsw[] which means convert BLK's to CHR's at the perimeter of the vnodes and other places where they enter the game (bootdev, mknod, sysctl).
# 46679	08-May-1999	phk	Fix some of the places where too much inside knowledge about major/minor layout and dev_t structure is being (ab)used.
# 46676	08-May-1999	phk	I got tired of seeing all the cdevsw[major(foo)] all over the place. Made a new (inline) function devsw(dev_t dev) and substituted it. Changed to the BDEV variant to this format as well: bdevsw(dev_t dev) DEVFS will eventually benefit from this change too.
# 46635	07-May-1999	phk	Continue where Julian left off in July 1998: Virtualize bdevsw[] from cdevsw. bdevsw() is now an (inline) function. Join CDEV_MODULE and BDEV_MODULE to DEV_MODULE (please pay attention to the order of the cmaj/bmaj arguments!) Join CDEV_DRIVER_MODULE and BDEV_DRIVER_MODULE to DEV_DRIVER_MODULE (ditto!) (Next step will be to convert all bdev dev_t's to cdev dev_t's before they get to do any damage^H^H^H^H^H^Hwork in the kernel.)
# 46381	03-May-1999	billf	Add sysctl descriptions to many SYSCTL_XXXs PR: kern/11197 Submitted by: Adrian Chadd <adrian@FreeBSD.org> Reviewed by: billf(spelling/style/minor nits) Looked at by: bde(style)
# 44679	12-Mar-1999	julian	Reviewed by: Many at differnt times in differnt parts, including alan, john, me, luoqi, and kirk Submitted by: Matt Dillon <dillon@frebsd.org> This change implements a relatively sophisticated fix to getnewbuf(). There were two problems with getnewbuf(). First, the writerecursion can lead to a system stack overflow when you have NFS and/or VN devices in the system. Second, the free/dirty buffer accounting was completely broken. Not only did the nfs routines blow it trying to manually account for the buffer state, but the accounting that was done did not work well with the purpose of their existance: figuring out when getnewbuf() needs to sleep. The meat of the change is to kern/vfs_bio.c. The remaining diffs are all minor except for NFS, which includes both the fixes for bp interaction AND fixes for a 'biodone(): buffer already done' lockup. Sys/buf.h also contains a chaining structure which is not used by this patchset but is used by other patches that are coming soon. This patch deliniated by tags PRE_MAT_GETBUF and POST_MAT_GETBUF. (sorry for the missing T matt)
# 44247	25-Feb-1999	dillon	Reviewed by: Julian Elischer <julian@whistle.com> Add d_parms() to {c,b}devsw[]. If non-NULL this function points to a device routine that will properly fill in the specinfo structure. vfs_subr.c's checkalias() supplies appropriate defaults. This change should be fully backwards compatible with existing devices.
# 44150	19-Feb-1999	dillon	Protect vn worklist and vn->v_{clean,dirty}blkhd at splbio(). Get rid of extra LIST_REMOVE() Reviewed by: hsu@FreeBSD.ORG (Jeffrey Hsu), mckusick@McKusick.COM Submitted by: hsu@FreeBSD.ORG (Jeffrey Hsu), dillon@backplane.com ( Matthew Dillon )
# 43618	04-Feb-1999	dillon	vp->v_object must be valid after normal flow of vfs_object_create() completes, change if() to KASSERT(). This is not a bug, we are simplify clarifying and optimizing the code. In if/else in vfs_object_create(), the failure of both conditionals will lead to a NULL object. Exit gracefully if this case occurs. ( this case does not normally occur, but needed to be handled ). Obtained from: Eivind Eklund <eivind@FreeBSD.org>
# 43403	29-Jan-1999	dillon	More const fixes for -Wall, -Wcast-qual
# 43311	28-Jan-1999	dillon	Fix warnings in preparation for adding -Wall -Wcast-qual to the kernel compile
# 42957	21-Jan-1999	dillon	This is a rather large commit that encompasses the new swapper, changes to the VM system to support the new swapper, VM bug fixes, several VM optimizations, and some additional revamping of the VM code. The specific bug fixes will be documented with additional forced commits. This commit is somewhat rough in regards to code cleanup issues. Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
# 42453	10-Jan-1999	eivind	KNFize, by bde.
# 42408	08-Jan-1999	eivind	Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as discussed on -hackers. Introduce 'KASSERT(assertion, ("panic message", args))' for simple check + panic. Reviewed by: msmith
# 42315	05-Jan-1999	eivind	Remove the 'waslocked' parameter to vfs_object_create().
# 42313	05-Jan-1999	eivind	Finish staticization.
# 42248	02-Jan-1999	bde	Ifdefed conditionally used simplock variables.
# 42043	24-Dec-1998	bde	Restored rev.1.31 which was clobbered by rev.1.69 (the big Lite2 merge). This fixes at least hanging in revoke(2) when a somewhat active slave pty is revoked. The hang made the window for the null pointer bug in ufsspec_{read,write} much larger. There are many other bugs in this area (revoke of an active fifo at best leaks memory...).
# 41995	22-Dec-1998	eivind	Check return value of tsleep(). I've checked of all call points - there does not seem to be a problem with this. PR: kern/8732 Analysis by: David G Andersen <danderse@cs.utah.edu> Tested by: Alfred Perlstein <bright@hotjobs.com>
# 41994	21-Dec-1998	eivind	Staticize.
# 41514	04-Dec-1998	archie	Examine all occurrences of sprintf(), strcat(), and str[n]cpy() for possible buffer overflow problems. Replaced most sprintf()'s with snprintf(); for others cases, added terminating NUL bytes where appropriate, replaced constants like "16" with sizeof(), etc. These changes include several bug fixes, but most changes are for maintainability's sake. Any instance where it wasn't "immediately obvious" that a buffer overflow could not occur was made safer. Reviewed by: Bruce Evans <bde@zeta.org.au> Reviewed by: Matthew Dillon <dillon@apollo.backplane.com> Reviewed by: Mike Spengler <mks@networkcs.com>
# 40787	31-Oct-1998	peter	Convert lists for bufs attached to vnodes from a LIST to a TAILQ. - Use TAILQ_* macros extensively instead of internal names - use b_xflags instead of the NOLIST magic number hack in the next pointer - clean bufs are inserted at the tail rather than the head. - redo dirty buffer insert so that metadata (negative lbn) goes to the tail directly rather than at the HEAD. This makes a difference when inserting dirty data blocks in lbn sorted order since data block insertion will not have to bypass all the metadata cruft. data is lbn sorted since it makes sense for clustering and writeback ordering, while metadata sorting doesn't help much since the lbn's are meaningless when walking the list for writebacks. Small systems will not notice much (if any) benefit from this, but really busy systems with large dirty block lists should get a lot more. I've tested this with softdep, and it doesn't seem to mind the change of queueing of metadata. Reviewed (in princible) by: dg Obtained from: partly from John Dyson's work-in-progress patches in June.
# 40777	31-Oct-1998	peter	The last argument to vm_object_page_clean() are now bit flags, rather than the old true/false. While here, have vfs_msync() only call vm_object_page_clean() with OBJPC_SYNC if called with MNT_WAIT flags. vfs_msync() is called at unmount time (with MNT_WAIT) and from the syncer process (formerly update). This should make dirty mmap writebacks a little less nasty. I have tested this a little with SOFTUPDATES enabled, but I don't normally use it since I've been badly burned too many times.
# 40728	29-Oct-1998	bde	Oops, rev.1.167 made the device number checking in bdevvp() too strict for mfs root mounts. Don't require major 255 to be in bdevsw[].
# 40722	29-Oct-1998	peter	Remove the V_SAVEMETA flag, nothing uses it any more now that msdosfs and ext2fs call vtruncbuf() directly. This simplifies and cleans up vinvalbuf() a little.
# 40659	26-Oct-1998	bde	Updated the major number check in vfs_object_create(). It's not clear if the check is necessary, but vfs_object_create() is called for all vnodes and it was silly to create objects for VBLK vnodes that don't even have a driver.
# 40648	25-Oct-1998	phk	Nitpicking and dusting performed on a train. Removes trivial warnings about unused variables, labels and other lint.
# 40647	25-Oct-1998	bde	Fixed device number checking in bdevvp(): - dev != NODEV was checked for, but 0 was returned on failure. This was fixed in Lite2 (except the return code was still slightly wrong (ENODEV instead of ENXIO)) but the changes were not merged. This case probably doesn't actually occur under FreeBSD. - major(dev) was not checked to have a valid non-NULL bdevsw entry. This caused panics when the driver for the root device didn't exist. Fixed minor misformattings in bdevvp(). Rev.1.14 consisted mainly of gratuitous reformattings that seem to have caused many Lite2 merge errors. PR: 8417
# 40349	14-Oct-1998	dt	Backed out rev. 1.164. It caused problems on SMP. PR: 8309
# 40286	13-Oct-1998	dg	Fixed two potentially serious classes of bugs: 1) The vnode pager wasn't properly tracking the file size due to "size" being page rounded in some cases and not in others. This sometimes resulted in corrupted files. First noticed by Terry Lambert. Fixed by changing the "size" pager_alloc parameter to be a 64bit byte value (as opposed to a 32bit page index) and changing the pagers and their callers to deal with this properly. 2) Fixed a bogus type cast in round_page() and trunc_page() that caused some 64bit offsets and sizes to be scrambled. Removing the cast required adding casts at a few dozen callers. There may be problems with other bogus casts in close-by macros. A quick check seemed to indicate that those were okay, however.
# 40267	12-Oct-1998	dt	UnVMIO vnodes of block devices when they are no longer in use. (Some things, like msdosfs, do not work (panic) on devices with VMIO enabled. FFS enable VMIO on mounted devices, and nothing previously disabled it, so, after you mounted FFS floppy, you could not mount msdosfs floppy anymore...) This is mostly a quick before-release fix. Reviewed by: bde
# 39187	14-Sep-1998	sos	Remove the SLICE code. This clearly needs alot more thought, and we dont need this to hunt us down in 3.0-RELEASE.
# 38866	05-Sep-1998	bde	Instantiate `nfs_mount_type' in a standard file so that it is present when nfs is an LKM. Declare it in a header file. Don't forget to use it in non-Lite2 code. Initialize it to -1 instead of to 0, since 0 will soon be the mount type number for the first vfs loaded. NetBSD uses strcmp() to avoid this ugly global.
# 38618	29-Aug-1998	bde	Oops, the previous revision unconfigured too much pre-Lite2 compatibilty cruft. At least lsvfs(1) was broken.
# 38289	12-Aug-1998	bde	Don't configure compatibility code for pre-Lite2 mount() calls by default. This code should go away soon.
# 37599	12-Jul-1998	dfr	Initialise all the fields separately in vattr_null since on the alpha they are not all the same width.
# 37555	11-Jul-1998	bde	Fixed printf format errors.
# 37101	21-Jun-1998	bde	Removed unused includes.
# 36874	10-Jun-1998	julian	Replace 'sleep()' with 'tsleep()' Accidentally imported from Kirk's codebase. Pointed out by: various.
# 36862	10-Jun-1998	julian	Submitted by: Kirk McKusick <mckusick@McKusick.COM> Fix for potential hang when trying to reboot the system or to forcibly unmount a soft update enabled filesystem. FreeBSD already handled the reboot case differently, this is however a better fix.
# 36735	07-Jun-1998	dfr	This commit fixes various 64bit portability problems required for FreeBSD/alpha. The most significant item is to change the command argument to ioctl functions from int to u_long. This change brings us inline with various other BSD versions. Driver writers may like to use (__FreeBSD_version == 300003) to detect this change. The prototype FreeBSD/alpha machdep will follow in a couple of days time.
# 36126	17-May-1998	tegge	Supply the correct process argument to dounmount when possible.
# 35319	19-Apr-1998	julian	Add changes and code to implement a functional DEVFS. This code will be turned on with the TWO options DEVFS and SLICE. (see LINT) Two labels PRE_DEVFS_SLICE and POST_DEVFS_SLICE will deliniate these changes. /dev will be automatically mounted by init (thanks phk) on bootup. See /sys/dev/slice/slice.4 for more info. All code should act the same without these options enabled. Mike Smith, Poul Henning Kamp, Soeren, and a few dozen others This code does not support the following: bad144 handling. Persistance. (My head is still hurting from the last time we discussed this) ATAPI flopies are not handled by the SLICE code yet. When this code is running, all major numbers are arbitrary and COULD be dynamically assigned. (this is not done, for POLA only) Minor numbers for disk slices ARE arbitray and dynamically assigned.
# 35264	18-Apr-1998	peter	In vfs_msync(), test to see if the vnode being examined is "interesting" (ie: it has a vm_object attached and is marked as OBJ_MIGHTBEDIRTY) before attempting to lock it. This should reduce the cpu hit that is incurred when doing a sync(2) and when the syncer process is doing the 30-second writeback of dirty mmap() data to disk. Skip this speedup if we are doing an unmount() to be sure to get everything - we can afford to occasionally miss a msync while the system is running, but not at unmount. I'm not sure about the VXLOCK and MNT_WAIT case, it seems a bit odd to skip doing a page_clean at unmount time just because a vnode is VXLOCKed, but that's what was being done before...
# 35220	16-Apr-1998	peter	When the softdep conversion took place, the periodic vfs_msync() from update got lost. This is responsible for ensuring that dirty mmap() pages get periodically written to disk. Without it, long time mmap's might not have their dirty pages written out at all of the system crashes or isn't cleanly shut down. This could be nasty if you've got a long-running writing via mmap(), dirty pages used to get written to disk within 30 seconds or so.
# 35214	15-Apr-1998	tegge	Unlock mountlist_slock if the mount point was busy (unmount in progress) during the attempt at lazy fsync.
# 34961	30-Mar-1998	phk	Eradicate the variable "time" from the kernel, using various measures. "time" wasn't a atomic variable, so splfoo() protection were needed around any access to it, unless you just wanted the seconds part. Most uses of time.tv_sec now uses the new variable time_second instead. gettime() changed to getmicrotime(0. Remove a couple of unneeded splfoo() protections, the new getmicrotime() is atomic, (until Bruce sets a breakpoint in it). A couple of places needed random data, so use read_random() instead of mucking about with time which isn't random. Add a new nfs_curusec() function. Mark a couple of bogosities involving the now disappeard time variable. Update ffs_update() to avoid the weird "== &time" checks, by fixing the one remaining call that passwd &time as args. Change profiling in ncr.c to use ticks instead of time. Resolution is the same. Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call hzto() which subtracts time" sequences. Reviewed by: bde
# 34928	28-Mar-1998	bde	Removed unused #includes.
# 34926	28-Mar-1998	bde	Don't depend on <sys/mount.h> including <sys/socket.h>.
# 34694	19-Mar-1998	dyson	In kern_physio.c fix tsleep priority messup. In vfs_bio.c, remove b_generation count usage, remove redundant reassignbuf, remove redundant spl(s), manage page PG_ZERO flags more correctly, utilize in invalid value for b_offset until it is properly initialized. Add asserts for #ifdef DIAGNOSTIC, when b_offset is improperly used. when a process is not performing I/O, and just waiting on a buffer generally, make the sleep priority low. only check page validity in getblk for B_VMIO buffers. In vfs_cluster, add b_offset asserts, correct pointer calculation for clustered reads. Improve readability of certain parts of the code. Remove redundant spl(s). In vfs_subr, correct usage of vfs_bio_awrite (From Andrew Gallatin <gallatin@cs.duke.edu>). More vtruncbuf problems fixed.
# 34690	19-Mar-1998	dyson	Fix an embarassing problem in vtruncbuf.
# 34639	17-Mar-1998	dyson	Correct a severely evil bug in the vtruncbuf code. It didn't cause me any problems until after the previous commit. This problem then caused a severe case of creeping crud on my diskdrive, and hosed my system so bad, that I needed to do a complete reinstall. Sorry!!! I assume that others have manifest this bug.
# 34612	16-Mar-1998	dyson	Allow vfs_ioopt to be enabled with a (temporary) config option.
# 34611	16-Mar-1998	dyson	Some VM improvements, including elimination of alot of Sig-11 problems. Tor Egge and others have helped with various VM bugs lately, but don't blame him -- blame me!!! pmap.c: 1) Create an object for kernel page table allocations. This fixes a bogus allocation method previously used for such, by grabbing pages from the kernel object, using bogus pindexes. (This was a code cleanup, and perhaps a minor system stability issue.) pmap.c: 2) Pre-set the modify and accessed bits when prudent. This will decrease bus traffic under certain circumstances. vfs_bio.c, vfs_cluster.c: 3) Rather than calculating the beginning virtual byte offset multiple times, stick the offset into the buffer header, so that the calculated offset can be reused. (Long long multiplies are often expensive, and this is a probably unmeasurable performance improvement, and code cleanup.) vfs_bio.c: 4) Handle write recursion more intelligently (but not perfectly) so that it is less likely to cause a system panic, and is also much more robust. vfs_bio.c: 5) getblk incorrectly wrote out blocks that are incorrectly sized. The problem is fixed, and writes blocks out ONLY when B_DELWRI is true. vfs_bio.c: 6) Check that already constituted buffers have fully valid pages. If not, then make sure that the B_CACHE bit is not set. (This was a major source of Sig-11 type problems.) vfs_bio.c: 7) Fix a potential system deadlock due to an incorrectly specified sleep priority while waiting for a buffer write operation. The change that I made opens the system up to serious problems, and we need to examine the issue of process sleep priorities. vfs_cluster.c, vfs_bio.c: 8) Make clustered reads work more correctly (and more completely) when buffers are already constituted, but not fully valid. (This was another system reliability issue.) vfs_subr.c, ffs_inode.c: 9) Create a vtruncbuf function, which is used by filesystems that can truncate files. The vinvalbuf forced a file sync type operation, while vtruncbuf only invalidates the buffers past the new end of file, and also invalidates the appropriate pages. (This was a system reliabiliy and performance issue.) 10) Modify FFS to use vtruncbuf. vm_object.c: 11) Make the object rundown mechanism for OBJT_VNODE type objects work more correctly. Included in that fix, create pager entries for the OBJT_DEAD pager type, so that paging requests that might slip in during race conditions are properly handled. (This was a system reliability issue.) vm_page.c: 12) Make some of the page validation routines be a little less picky about arguments passed to them. Also, support page invalidation change the object generation count so that we handle generation counts a little more robustly. vm_pageout.c: 13) Further reduce pageout daemon activity when the system doesn't need help from it. There should be no additional performance decrease even when the pageout daemon is running. (This was a significant performance issue.) vnode_pager.c: 14) Teach the vnode pager to handle race conditions during vnode deallocations.
# 34577	14-Mar-1998	dyson	Disable the vfs.ioopt option for now, so that we don't get gratuitious bugreports. I might not be able to fix the problems before 3.0, due to other, more important things.
# 34568	14-Mar-1998	tegge	Don't misuse vnode interlocks in routines that can be called from interrupts. PR: 5893
# 34266	08-Mar-1998	julian	Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman) Submitted by: Kirk McKusick (mcKusick@mckusick.com) Obtained from: WHistle development tree
# 34206	07-Mar-1998	dyson	This mega-commit is meant to fix numerous interrelated problems. There has been some bitrot and incorrect assumptions in the vfs_bio code. These problems have manifest themselves worse on NFS type filesystems, but can still affect local filesystems under certain circumstances. Most of the problems have involved mmap consistancy, and as a side-effect broke the vfs.ioopt code. This code might have been committed seperately, but almost everything is interrelated. 1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that are fully valid. 2) Rather than deactivating erroneously read initial (header) pages in kern_exec, we now free them. 3) Fix the rundown of non-VMIO buffers that are in an inconsistent (missing vp) state. 4) Fix the disassociation of pages from buffers in brelse. The previous code had rotted and was faulty in a couple of important circumstances. 5) Remove a gratuitious buffer wakeup in vfs_vmio_release. 6) Remove a crufty and currently unused cluster mechanism for VBLK files in vfs_bio_awrite. When the code is functional, I'll add back a cleaner version. 7) The page busy count wakeups assocated with the buffer cache usage were incorrectly cleaned up in a previous commit by me. Revert to the original, correct version, but with a cleaner implementation. 8) The cluster read code now tries to keep data associated with buffers more aggressively (without breaking the heuristics) when it is presumed that the read data (buffers) will be soon needed. 9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The delay loop waiting is not useful for filesystem locks, due to the length of the time intervals. 10) Correct and clean-up spec_getpages. 11) Implement a fully functional nfs_getpages, nfs_putpages. 12) Fix nfs_write so that modifications are coherent with the NFS data on the server disk (at least as well as NFS seems to allow.) 13) Properly support MS_INVALIDATE on NFS. 14) Properly pass down MS_INVALIDATE to lower levels of the VM code from vm_map_clean. 15) Better support the notion of pages being busy but valid, so that fewer in-transit waits occur. (use p->busy more for pageouts instead of PG_BUSY.) Since the page is fully valid, it is still usable for reads. 16) It is possible (in error) for cached pages to be busy. Make the page allocation code handle that case correctly. (It should probably be a printf or panic, but I want the system to handle coding errors robustly. I'll probably add a printf.) 17) Correct the design and usage of vm_page_sleep. It didn't handle consistancy problems very well, so make the design a little less lofty. After vm_page_sleep, if it ever blocked, it is still important to relookup the page (if the object generation count changed), and verify it's status (always.) 18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up. 19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush. 20) Fix vm_pager_put_pages and it's descendents to support an int flag instead of a boolean, so that we can pass down the invalidate bit.
# 33967	01-Mar-1998	dyson	Change vfs.ioopt default back to '0'.
# 33936	01-Mar-1998	dyson	1) Use a more consistent page wait methodology. 2) Do not unnecessarily force page blocking when paging pages out. 3) Further improve swap pager performance and correctness, including fixing the paging in progress deadlock (except in severe I/O error conditions.) 4) Enable vfs_ioopt=1 as a default. 5) Fix and enable the page prezeroing in SMP mode. All in all, SMP systems especially should show a significant improvement in "snappyness."
# 33755	23-Feb-1998	dyson	Clean-up the vget mechanism by permanently attaching VM objects to vnodes, therefore vget doesn't need to do so anymore. Other minor improvements include the temp free vnode queue obeying the VAGE flag and a printf that warns of to-be-removed code being executed.
# 33205	10-Feb-1998	kato	Fixed vnode interlock handling. Reviewed by: Bruce Evans <bde@zeta.org.au> Tor Egge <Tor.Egge@idi.ntnu.no>
# 33181	09-Feb-1998	eivind	Staticize.
# 33152	07-Feb-1998	kato	When the vp is lcoked, vget() calls vfs_object_create() with waslocked = TRUE. This change may fix lockmgr panic in umapfs/nullfs. PR: 5634 Reviewed by: "John S. Dyson" <toor@dyson.iquest.net> Suggested by: Bruce Evans <bde@zeta.org.au>
# 33134	06-Feb-1998	eivind	Back out DIAGNOSTIC changes.
# 33109	05-Feb-1998	dyson	1) Start using a cleaner and more consistant page allocator instead of the various ad-hoc schemes. 2) When bringing in UPAGES, the pmap code needs to do another vm_page_lookup. 3) When appropriate, set the PG_A or PG_M bits a-priori to both avoid some processor errata, and to minimize redundant processor updating of page tables. 4) Modify pmap_protect so that it can only remove permissions (as it originally supported.) The additional capability is not needed. 5) Streamline read-only to read-write page mappings. 6) For pmap_copy_page, don't enable write mapping for source page. 7) Correct and clean-up pmap_incore. 8) Cluster initial kern_exec pagin. 9) Removal of some minor lint from kern_malloc. 10) Correct some ioopt code. 11) Remove some dead code from the MI swapout routine. 12) Correct vm_object_deallocate (to remove backing_object ref.) 13) Fix dead object handling, that had problems under heavy memory load. 14) Add minor vm_page_lookup improvements. 15) Some pages are not in objects, and make sure that the vm_page.c can properly support such pages. 16) Add some more page deficit handling. 17) Some minor code readability improvements.
# 33108	04-Feb-1998	eivind	Turn DIAGNOSTIC into a new-style option.
# 32910	31-Jan-1998	tegge	Update freevnodes when adding a vnode to the head of the free list.
# 32724	24-Jan-1998	dyson	Add better support for larger I/O clusters, including larger physical I/O. The support is not mature yet, and some of the underlying implementation needs help. However, support does exist for IDE devices now.
# 32702	22-Jan-1998	dyson	VM level code cleanups. 1) Start using TSM. Struct procs continue to point to upages structure, after being freed. Struct vmspace continues to point to pte object and kva space for kstack. u_map is now superfluous. 2) vm_map's don't need to be reference counted. They always exist either in the kernel or in a vmspace. The vmspaces are managed by reference counts. 3) Remove the "wired" vm_map nonsense. 4) No need to keep a cache of kernel stack kva's. 5) Get rid of strange looking ++var, and change to var++. 6) Change more data structures to use our "zone" allocator. Added struct proc, struct vmspace and struct vnode. This saves a significant amount of kva space and physical memory. Additionally, this enables TSM for the zone managed memory. 7) Keep ioopt disabled for now. 8) Remove the now bogus "single use" map concept. 9) Use generation counts or id's for data structures residing in TSM, where it allows us to avoid unneeded restart overhead during traversals, where blocking might occur. 10) Account better for memory deficits, so the pageout daemon will be able to make enough memory available (experimental.) 11) Fix some vnode locking problems. (From Tor, I think.) 12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp. (experimental.) 13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c code. Use generation counts, get rid of unneded collpase operations, and clean up the cluster code. 14) Make vm_zone more suitable for TSM. This commit is partially as a result of discussions and contributions from other people, including DG, Tor Egge, PHK, and probably others that I have forgotten to attribute (so let me know, if I forgot.) This is not the infamous, final cleanup of the vnode stuff, but a necessary step. Vnode mgmt should be correct, but things might still change, and there is still some missing stuff (like ioopt, and physical backing of non-merged cache files, debugging of layering concepts.)
# 32585	17-Jan-1998	dyson	Tie up some loose ends in vnode/object management. Remove an unneeded config option in pmap. Fix a problem with faulting in pages. Clean-up some loose ends in swap pager memory management. The system should be much more stable, but all subtile bugs aren't fixed yet.
# 32456	12-Jan-1998	dyson	Fix another vnode leak.
# 32454	12-Jan-1998	dyson	Fix some vnode management problems, and better mgmt of vnode free list. Fix the UIO optimization code. Fix an assumption in vm_map_insert regarding allocation of swap pagers. Fix an spl problem in the collapse handling in vm_object_deallocate. When pages are freed from vnode objects, and the criteria for putting the associated vnode onto the free list is reached, either put the vnode onto the list, or put it onto an interrupt safe version of the list, for further transfer onto the actual free list. Some minor syntax changes changing pre-decs, pre-incs to post versions. Remove a bogus timeout (that I added for debugging) from vn_lock. PHK will likely still have problems with the vnode list management, and so do I, but it is better than it was.
# 32320	07-Jan-1998	dyson	Disable io optimizations again, minor bug found, and will be fixed in a few days.
# 32286	06-Jan-1998	dyson	Make our v_usecount vnode reference count work identically to the original BSD code. The association between the vnode and the vm_object no longer includes reference counts. The major difference is that vm_object's are no longer freed gratuitiously from the vnode, and so once an object is created for the vnode, it will last as long as the vnode does. When a vnode object reference count is incremented, then the underlying vnode reference count is incremented also. The two "objects" are now more intimately related, and so the interactions are now much less complex. When vnodes are now normally placed onto the free queue with an object still attached. The rundown of the object happens at vnode rundown time, and happens with exactly the same filesystem semantics of the original VFS code. There is absolutely no need for vnode_pager_uncache and other travesties like that anymore. A side-effect of these changes is that SMP locking should be much simpler, the I/O copyin/copyout optimizations work, NFS should be more ponderable, and further work on layered filesystems should be less frustrating, because of the totally coherent management of the vnode objects and vnodes. Please be careful with your system while running this code, but I would greatly appreciate feedback as soon a reasonably possible.
# 32094	29-Dec-1997	dyson	Add the vnode interlock back around vref.
# 32072	29-Dec-1997	dyson	Fix the decl of vfs_ioopt, allow LFS to compile again, fix a minor problem with the object cache removal.
# 32071	29-Dec-1997	dyson	Lots of improvements, including restructring the caching and management of vnodes and objects. There are some metadata performance improvements that come along with this. There are also a few prototypes added when the need is noticed. Changes include: 1) Cleaning up vref, vget. 2) Removal of the object cache. 3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore. 4) Correct some missing LK_RETRY's in vn_lock. 5) Correct the page range in the code for msync. Be gentle, and please give me feedback asap.
# 31853	19-Dec-1997	dyson	Some performance improvements, and code cleanups (including changing our expensive OFF_TO_IDX to btoc whenever possible.)
# 31727	15-Dec-1997	wollman	Add support for poll(2) on files. vop_nopoll() now returns POLLNVAL if one of the new poll types is requested; hopefully this will not break any existing code. (This is done so that programs have a dependable way of determining whether a filesystem supports the extended poll types or not.) The new poll types added are: POLLWRITE - file contents may have been modified POLLNLINK - file was linked, unlinked, or renamed POLLATTRIB - file's attributes may have been changed POLLEXTEND - file was extended Note that the internal operation of poll() means that it is impossible for two processes to reliably poll for the same event (this could be fixed but may not be worth it), so it is not possible to rewrite `tail -f' to use poll at this time.
# 31352	22-Nov-1997	bde	Staticized.
# 31132	12-Nov-1997	julian	Reviewed by: various. Ever since I first say the way the mount flags were used I've hated the fact that modes, and events, internal and exported, and short-term and long term flags are all thrown together. Finally it's annoyed me enough.. This patch to the entire FreeBSD tree adds a second mount flag word to the mount struct. it is not exported to userspace. I have moved some of the non exported flags over to this word. this means that we now have 8 free bits in the mount flags. There are another two that might well move over, but which I'm not sure about. The only user visible change would have been in pstat -v, except that davidg has disabled it anyhow. I'd still like to move the state flags and the 'command' flags apart from each other.. e.g. MNT_FORCE really doesn't have the same semantics as MNT_RDONLY, but that's left for another day.
# 31016	07-Nov-1997	phk	Remove a bunch of variables which were unused both in GENERIC and LINT. Found by: -Wunused
# 30743	26-Oct-1997	phk	VFS interior redecoration. Rename vn_default_error to vop_defaultop all over the place. Move vn_bwrite from vfs_bio.c to vfs_default.c and call it vop_stdbwrite. Use vop_null instead of nullop. Move vop_nopoll from vfs_subr.c to vfs_default.c Move vop_sharedlock from vfs_subr.c to vfs_default.c Move vop_nolock from vfs_subr.c to vfs_default.c Move vop_nounlock from vfs_subr.c to vfs_default.c Move vop_noislocked from vfs_subr.c to vfs_default.c Use vop_ebadf instead of *_ebadf. Add vop_defaultop for getpages on master vnode in MFS.
# 30354	12-Oct-1997	phk	Last major round (Unless Bruce thinks of somthing :-) of malloc changes. Distribute all but the most fundamental malloc types. This time I also remembered the trick to making things static: Put "static" in front of them. A couple of finer points by: bde
# 30309	11-Oct-1997	phk	Distribute and statizice a lot of the malloc M_* types. Substantial input from: bde
# 30293	11-Oct-1997	phk	Dike out a weird warning.
# 29869	26-Sep-1997	phk	I lost a bit of my change in the last commit, this is more like it. Noticed by: bde
# 29853	25-Sep-1997	phk	Reduce the target number of vnodes on the freelist from desiredvnodes (usually a couple of thousand) to 25. The measured impact on cache-hits doesn't justify spending memory this way: Target number of free vnodes versus namecache hit rate in % during a make world: 10 98.5316 200 98.5479 500 98.5546 1000 98.5709 3000 98.6006 4000 98.6126
# 29788	24-Sep-1997	phk	A couple of handles to tweak, more statistics.
# 29506	16-Sep-1997	bde	Fixed gratuitous ANSIisms.
# 29358	14-Sep-1997	peter	Provide a 'return true' poll vnode op rather than duplicating the 'do nothing' case all over the various filesystems.
# 29323	13-Sep-1997	peter	print correct function name in a panic (vop_nolock -> vop_sharedlock)
# 29208	07-Sep-1997	bde	Removed yet more vestiges of config-time swap configuration and/or cleaned up nearby cruft.
# 29203	07-Sep-1997	bde	Removed vestiges of config-time "argument processing" configuration.
# 29076	03-Sep-1997	phk	Hmm, this is hopefully better.
# 29070	03-Sep-1997	phk	Revert the v_usecount handling in relation to VOP_INACTIVE.
# 29041	02-Sep-1997	bde	Removed unused #includes.
# 28954	31-Aug-1997	phk	Change the 0xdeadb hack to a flag called VDOOMED. Introduce VFREE which indicates that vnode is on freelist. Rename vholdrele() to vdrop(). Create vfree() and vbusy() to add/delete vnode from freelist. Add vfree()/vbusy() to keep (v_holdcnt != 0 \|\| v_usecount != 0) vnodes off the freelist. Generalize vhold()/v_holdcnt to mean "do not recycle". Fix reassignbuf()s lack of use of vhold(). Use vhold() instead of checking v_cache_src list. Remove vtouch(), the vnodes are always vget'ed soon enough after for it to have any measuable effect. Add sysctl debug.freevnodes to keep track of things. Move cache_purge() up in getnewvnodes to avoid race. Decrement v_usecount after VOP_INACTIVE(), put a vhold() on it during VOP_INACTIVE() Unmacroize vhold()/vdrop() Print out VDOOMED and VFREE flags (XXX: should use %b) Reviewed by: dyson
# 28795	26-Aug-1997	bde	Restored rev.1.92 which was clobbered by the previous commit.
# 28774	26-Aug-1997	dyson	Back out some incorrect changes that was worse than the original bug.
# 28558	22-Aug-1997	dyson	This is a trial improvement for the vnode reference count while on the vnode free list problem. Also, the vnode age flag is no longer used by the vnode pager. (It is actually incorrect to use then.) Constructive feedback welcome -- just be kind.
# 28551	21-Aug-1997	bde	#include <machine/limits.h> explicitly in the few places that it is required.
# 28270	16-Aug-1997	wollman	Fix all areas of the system (or at least all those in LINT) to avoid storing socket addresses in mbufs. (Socket buffers are the one exception.) A number of kernel APIs needed to get fixed in order to make this happen. Also, fix three protocol families which kept PCBs in mbufs to not malloc them instead. Delete some old compatibility cruft while we're at it, and add some new routines in the in_cksum family.
# 27892	04-Aug-1997	dyson	Fix a problem with the vfs vnode caching that it doesn't grow quickly enough and can cause some strange performance problems. Specifically, at or near startup time is when the problem is worst. To reproduce the problem, run "lat_syscall stat" from the alpha lmbench code right after bootup. A positive side effect of this mod is that the name cache can be set to grow again by sysctl. A noticable positive performance impact is realized due to a larger namecache being available as needed (or tuned.)
# 27473	17-Jul-1997	dfr	Merge WebNFS support from NetBSD Obtained from: NetBSD
# 26780	22-Jun-1997	dyson	Remove a window during running down a file vnode. Also, the OBJ_DEAD flag wasn't being respected during vref(), et. al. Note that this isn't the eventual fix for the locking problem. Fine grained SMP in the VM and VFS code will require (lots) more work.
# 26533	10-Jun-1997	dg	Disabled the kern.vnode sysctl variable. It's causing system crashes on large systems and needs to be re-thinked or removed wholesale.
# 25509	06-May-1997	phk	Fix a race condition that did, after all, exist. Reviewed by: phk Submitted by: dfr
# 25453	04-May-1997	phk	1. Add a {pointer, v_id} pair to the vnode to store the reference to the ".." vnode. This is cheaper storagewise than keeping it in the namecache, and it makes more sense since it's a 1:1 mapping. 2. Also handle the case of "." more intelligently rather than stuff the namecache with pointless entries. 3. Add two lists to the vnode and hang namecache entries which go from or to this vnode. When cleaning a vnode, delete all namecache entries it invalidates. 4. Never reuse namecache enties, malloc new ones when we need it, free old ones when they die. No longer a hard limit on how many we can have. 5. Remove the upper limit on namelength of namecache entries. 6. Make a global list for negative namecache entries, limit their number to a sysctl'able (debug.ncnegfactor) fraction of the total namecache. Currently the default fraction is 1/16th. (Suggestions for better default wanted!) 7. Assign v_id correctly in the face of 32bit rollover. 8. Remove the LRU list for namecache entries, not needed. Remove the #ifdef NCH_STATISTICS stuff, it's not needed either. 9. Use the vnode freelist as a true LRU list, also for namecache accesses. 10. Reuse vnodes more aggresively but also more selectively, if we can't reuse, malloc a new one. There is no longer a hard limit on their number, they grow to the point where we don't reuse potentially usable vnodes. A vnode will not get recycled if still has pages in core or if it is the source of namecache entries (Yes, this does indeed work :-) "." and ".." are not namecache entries any longer...) 11. Do not overload the v_id field in namecache entries with whiteout information, use a char sized flags field instead, so we can get rid of the vpid and v_id fields from the namecache struct. Since we're linked to the vnodes and purged when they're cleaned, we don't have to check the v_id any more. 12. NFS knew about the limitation on name length in the namecache, it shouldn't and doesn't now. Bugs: The namecache statistics no longer includes the hits for ".." and "." hits. Performance impact: Generally in the +/- 0.5% for "normal" workstations, but I hope this will allow the system to be selftuning over a bigger range of "special" applications. The case where RAM is available but unused for cache because we don't have any vnodes should be gone. Future work: Straighten out the namecache statistics. "desiredvnodes" is still used to (bogusly ?) size hash tables in the filesystems. I have still to find a way to safely free unused vnodes back so their number can shrink when not needed. There is a few uses of the v_id field left in the filesystems, scheduled for demolition at a later time. Maybe a one slot cache for unused namecache entries should be implemented to decrease the malloc/free frequency.
# 25294	30-Apr-1997	dyson	Staticize an unnecessarily global function: vputrele. Submitted by: Michael Hancock <michaelh@cet.co.jp>
# 25129	25-Apr-1997	peter	copyin the export network mask to the correct variable. Submitted by: Mike Hibler <mike@marker.cs.utah.edu>, PR#3380
# 24624	04-Apr-1997	dfr	Add a function vop_sharedlock which a copy of vop_nolock without the implementation #ifdef out. This can be used for now by NFS. As soon as all the other filesystems' locking is fixed, this can go away. Print the vnode address in vprint for easier debugging.
# 24487	01-Apr-1997	bde	Use OID_AUTO instead of magic number for the Lite2 sysctl debug.busyprt. Removed declaration of vfs_unmountroot() again. Staticized vgonel().
# 23389	05-Mar-1997	dg	Fixed splbio problems in vinvalbuf. Closes PR#2875, although fixed differently by me.
# 23382	04-Mar-1997	bde	Attach vfs_sysctl() one level lower so that only the levels below VFS_GENERIC aren't done in the FreeBSD way. The previous commit broke the nfs sysctls.
# 23333	03-Mar-1997	bde	Merged Lite2's vfs_sysctl(). It doesn't fit very well into FreeBSD's (phk's) sysctl framework, and I needed special code to disambiguate the VFS_GENERIC node from the VFS_VFSCONF leaf, so I only converted the leaves to the FreeBSD framework. The error handling isn't quite right. CSRGS's sysctls seem to return ENOTDIR too much and FreeBSD's sysctls don't agree with the man page.
# 23289	02-Mar-1997	bde	Restored some pre-Lite2-merge source-level compatibility to the mount() and getvfsbyname() interfaces. The new interfaces are now hidden from applications unless _NEW_VFSCONF is defined. The new vfsconf interfaces don't work yet.
# 23254	02-Mar-1997	bde	Moved vfs sysctls to where Lite2 put them. No code changes yet.
# 23159	27-Feb-1997	bde	Fixed Lite2 merge of spechash simplelocking. It was misplaced in checkalias() and missing in vfinddev() and vcount().
# 23149	27-Feb-1997	dyson	Fix the previous simple_lock fix breakage in the combined vput/vrele routine. Fix a panic message. Fix the vop_nounlock routine so that "special" filesystems that use it work correctly.
# 23145	27-Feb-1997	dyson	Fix the simple_lock problem with the physical I/O buffer code, and also fix the missing simple_unlock in vrele, and improve vrele/vput by merging them into one routine. BDE pointed these problems out.
# 23135	26-Feb-1997	bde	Fixed unmounting of the root fs. vfs_unmountroot() wasn't fully updated to do Lite2 locking and vfs_unmountall() wasn't as simple as the Lite2 version.
# 23118	25-Feb-1997	bde	Merged some missing locking from Lite2: - getnewvnode() and vref() were missing one simple_unlock() each. - the Lite2 locking changes weren't merged at all in printlockedvnodes() or sysctl_vnode(). Merging these undid some KNF style regressions.
# 22975	22-Feb-1997	peter	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
# 22521	10-Feb-1997	dyson	This is the kernel Lite/2 commit. There are some requisite userland changes, so don't expect to be able to run the kernel as-is (very well) without the appropriate Lite/2 userland changes. The system boots and can mount UFS filesystems. Untested: ext2fs, msdosfs, NFS Known problems: Incorrect Berkeley ID strings in some files. Mount_std mounts will not work until the getfsent library routine is changed. Reviewed by: various people Submitted by: Jeffery Hsu <hsu@freebsd.org>
# 21770	16-Jan-1997	bde	Removed option EXTRAVNODES. All versions of FreeBSD-2.x have a sysctl variable `kern.maxvnodes' which gives much better control over vnode allocation than EXTRAVNODES (except in -current between 1995/10/28 and 1996/11/12, kern.maxvnodes was read-only and thus useless).
# 21673	14-Jan-1997	jkh	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# 21002	29-Dec-1996	dyson	This commit is the embodiment of some VFS read clustering improvements. Firstly, now our read-ahead clustering is on a file descriptor basis and not on a per-vnode basis. This will allow multiple processes reading the same file to take advantage of read-ahead clustering. Secondly, there previously was a problem with large reads still using the ramp-up algorithm. Of course, that was bogus, and now we read the entire "chunk" off of the disk in one operation. The read-ahead clustering algorithm should use less CPU than the previous also (I hope :-)). NOTE: THAT LKMS MUST BE REBUILT!!!
# 19667	12-Nov-1996	bde	Restored writability of kern.maxvnodes. It was broken a year ago in rev.1.29 of kern_sysctl.c. Should be in 2.2.
# 19229	28-Oct-1996	phk	init_main.c: pass -d to init if DEVFS_ROOT kern_conf.c: gd driver is a disk. vfs_subr.c: include opt_devfs.h
# 18996	17-Oct-1996	jkh	I'm not sure why, but Netcon's TFS filesystem code doesn't want to add free vnodes back to the freelist. They must do their own vnode management. Anyway, this change is only activated with their filesystem and doesn't affect anyone else. Whoops, forgot the submitted-by lines in my previous commits too.. :-( Submitted-By: Tony Ardolino <tony@netcon.com>
# 18973	17-Oct-1996	dyson	Clean up the rundown of the object backing a vnode. This should fix NFS problems associated with forcible dismounts.
# 18527	28-Sep-1996	dyson	Correct vget by removing a window where a vnode can potentially go away.
# 18397	19-Sep-1996	nate	In sys/time.h, struct timespec is defined as: /* * Structure defined by POSIX.4 to be like a timeval. / struct timespec { time_t ts_sec; / seconds / long ts_nsec; / and nanoseconds */ }; The correct names of the fields are tv_sec and tv_nsec. Reminded by: James Drobina <jdrobina@infinet.com>
# 17761	21-Aug-1996	dyson	Even though this looks like it, this is not a complex code change. The interface into the "VMIO" system has changed to be more consistant and robust. Essentially, it is now no longer necessary to call vn_open to get merged VM/Buffer cache operation, and exceptional conditions such as merged operation of VBLK devices is simpler and more correct. This code corrects a potentially large set of problems including the problems with ktrace output and loaded systems, file create/deletes, etc. Most of the changes to NFS are cosmetic and name changes, eliminating a layer of subroutine calls. The direct calls to vput/vrele have been re-instituted for better cross platform compatibility. Reviewed by: davidg
# 17605	15-Aug-1996	dyson	Certain vnode buffer list operations were not being spl protected, and they needed to be. Brelse for example can be called at interrupt level, and the buffer list operations were not being protected from it.
# 17349	30-Jul-1996	bde	Only use the special bdevvp() for DEVFS if DEVFS_ROOT is defined. This makes option DEVFS safe to use again (although mounting devfs is unsafe).
# 17272	24-Jul-1996	phk	DEVFS needs a special bdevvp().
# 17122	12-Jul-1996	bde	Staticized a few variables. Fixed warnings about unused variables.
# 16025	31-May-1996	peter	Add an option "EXTRA_VNODES" to cause an extra number of vnode structures to be allocated at boot time. This is an expensive option, as they consume physical ram and are not pageable etc. In certain situations, this kind of option is quite useful, especially for news servers that access a large number of directories at random and torture the name cache. Defining 5000 or 10000 extra vnodes should cut down the amount of vnode recycling somewhat, which should allow better name and directory caching etc. This is a "your mileage may vary" option, with no real indication of what works best for your machine except trial and error. Too many will cost you ram that you could otherwise use for disk buffers etc. This is based on something John Dyson mentioned to me a while ago.
# 14425	09-Mar-1996	dyson	Put the "free vnode isn't" check back in the right place.
# 13490	19-Jan-1996	dyson	Eliminated many redundant vm_map_lookup operations for vm_mmap. Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish overhead for merged cache. Efficiency improvement for vfs_cluster. It used to do alot of redundant calls to cluster_rbuild. Correct the ordering for vrele of .text and release of credentials. Use the selective tlb update for 486/586/P6. Numerous fixes to the size of objects allocated for files. Additionally, fixes in the various pagers. Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs. Fixes in the swap pager for exhausted resources. The pageout code will not as readily thrash. Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE), thereby improving efficiency of several routines. Eliminate even more unnecessary vm_page_protect operations. Significantly speed up process forks. Make vm_object_page_clean more efficient, thereby eliminating the pause that happens every 30seconds. Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the case of filesystems mounted async. Fix a panic with busy pages when write clustering is done for non-VMIO buffers.
# 13228	04-Jan-1996	wollman	Convert DDB to new-style option.
# 13168	02-Jan-1996	dg	Moved the #ifdef DIAGNOSTIC in vrele() so that the check for negative v_usecount is always performed and only the call to vprint is conditional.
# 12913	17-Dec-1995	phk	Staticize. Unstaticize a function in scsi/scsi_base that was used, with an undocumented option. My last count on the LINT kernel shows: Total symbols: 3647 unref symbols: 463 undef symbols: 4 1 ref symbols: 1751 2 ref symbols: 485 Approaching the pain threshold now.
# 12767	11-Dec-1995	dyson	Changes to support 1Tb filesizes. Pages are now named by an (object,index) pair instead of (object,offset) pair.
# 12662	07-Dec-1995	dg	Untangled the vm.h include file spaghetti.
# 12650	06-Dec-1995	phk	A couple of minor tweaks to the sysctl stuff.
# 12577	02-Dec-1995	bde	Completed function declarations and/or added prototypes.
# 12519	29-Nov-1995	phk	A test was backwards. Noticed by: Cheng, Hsiao-Yang <sycheng@cis.ufl.edu>
# 12429	20-Nov-1995	phk	Mega commit for sysctl. Convert the remaining sysctl stuff to the new way of doing things. the devconf stuff is the reason for the large number of files. Cleaned up some compiler warnings while I were there.
# 12324	16-Nov-1995	bde	Fixed support for DIAGNOSTIC option. SYSCTL_INT() depends on kernel.h.
# 12283	14-Nov-1995	phk	Change some of the debug sysctl vars. The semantics of these will change.
# 12199	11-Nov-1995	bde	Fixed type of vfs_free_netcred(). Removed redundant declaration of insmntque().
# 12158	09-Nov-1995	bde	Introduced a type `vop_t' for vnode operation functions and used it 1138 times (:-() in casts and a few more times in declarations. This change is null for the i386. The type has to be `typedef int vop_t(void *)' and not `typedef int vop_t()' because `gcc -Wstrict-prototypes' warns about the latter. Since vnode op functions are called with args of different (struct pointer) types, neither of these function types is any use for type checking of the arg, so it would be preferable not to use the complete function type, especially since using the complete type requires adding 1138 casts to avoid compiler warnings and another 40+ casts to reverse the function pointer conversions before calling the functions.
# 12136	07-Nov-1995	dyson	This is a modification missed by me in the msync fixes a few days ago.
# 11852	28-Oct-1995	bde	Call vfs_unbusy() before error returns from sysctl_vnode(). This fixes PR 795. Set the size before one error return from sysctl_vnode() the same as before the other. The caller might want to know about the amount successfully read although the current caller doesn't.
# 10275	25-Aug-1995	bde	Don't compile the diagnostic functions vhold() and holdrele() unless DIAGNOSTIC is defined.
# 10027	11-Aug-1995	dg	Converted mountlist to a CIRCLEQ. Partially obtained from: 4.4BSD-Lite2
# 9507	13-Jul-1995	dg	NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct proc or any VM system structure will have to be rebuilt!!! Much needed overhaul of the VM system. Included in this first round of changes: 1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages, haspage, and sync operations are supported. The haspage interface now provides information about clusterability. All pager routines now take struct vm_object's instead of "pagers". 2) Improved data structures. In the previous paradigm, there is constant confusion caused by pagers being both a data structure ("allocate a pager") and a collection of routines. The idea of a pager structure has escentially been eliminated. Objects now have types, and this type is used to index the appropriate pager. In most cases, items in the pager structure were duplicated in the object data structure and thus were unnecessary. In the few cases that remained, a un_pager structure union was created in the object to contain these items. 3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now be removed. For instance, vm_object_enter(), vm_object_lookup(), vm_object_remove(), and the associated object hash list were some of the things that were removed. 4) simple_lock's removed. Discussion with several people reveals that the SMP locking primitives used in the VM system aren't likely the mechanism that we'll be adopting. Even if it were, the locking that was in the code was very inadequate and would have to be mostly re-done anyway. The locking in a uni-processor kernel was a no-op but went a long way toward making the code difficult to read and debug. 5) Places that attempted to kludge-up the fact that we don't have kernel thread support have been fixed to reflect the reality that we are really dealing with processes, not threads. The VM system didn't have complete thread support, so the comments and mis-named routines were just wrong. We now use tsleep and wakeup directly in the lock routines, for instance. 6) Where appropriate, the pagers have been improved, especially in the pager_alloc routines. Most of the pager_allocs have been rewritten and are now faster and easier to maintain. 7) The pagedaemon pageout clustering algorithm has been rewritten and now tries harder to output an even number of pages before and after the requested page. This is sort of the reverse of the ideal pagein algorithm and should provide better overall performance. 8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup have been removed. Some other unnecessary casts have also been removed. 9) Some almost useless debugging code removed. 10) Terminology of shadow objects vs. backing objects straightened out. The fact that the vm_object data structure escentially had this backwards really confused things. The use of "shadow" and "backing object" throughout the code is now internally consistent and correct in the Mach terminology. 11) Several minor bug fixes, including one in the vm daemon that caused 0 RSS objects to not get purged as intended. 12) A "default pager" has now been created which cleans up the transition of objects to the "swap" type. The previous checks throughout the code for swp->pg_data != NULL were really ugly. This change also provides the rudiments for future backing of "anonymous" memory by something other than the swap pager (via the vnode pager, for example), and it allows the decision about which of these pagers to use to be made dynamically (although will need some additional decision code to do this, of course). 13) (dyson) MAP_COPY has been deprecated and the corresponding "copy object" code has been removed. MAP_COPY was undocumented and non- standard. It was furthermore broken in several ways which caused its behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will continue to work correctly, but via the slightly different semantics of MAP_PRIVATE. 14) (dyson) Sharing maps have been removed. It's marginal usefulness in a threads design can be worked around in other ways. Both #12 and #13 were done to simplify the code and improve readability and maintain- ability. (As were most all of these changes) TODO: 1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing this will reduce the vnode pager to a mere fraction of its current size. 2) Rewrite vm_fault and the swap/vnode pagers to use the clustering information provided by the new haspage pager interface. This will substantially reduce the overhead by eliminating a large number of VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be improved to provide both a "behind" and "ahead" indication of contiguousness. 3) Implement the extended features of pager_haspage in swap_pager_haspage(). It currently just says 0 pages ahead/behind. 4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps via a much more general mechanism that could also be used for disk striping of regular filesystems. 5) Do something to improve the architecture of vm_object_collapse(). The fact that it makes calls into the swap pager and knows too much about how the swap pager operates really bothers me. It also doesn't allow for collapsing of non-swap pager objects ("unnamed" objects backed by other pagers).
# 9436	08-Jul-1995	dg	Improve negative usecount diagnostic a little.
# 9356	28-Jun-1995	dg	1) Converted v_vmdata to v_object. 2) Removed unnecessary vm_object_lookup()/pager_cache(object, TRUE) pairs after vnode_pager_alloc() calls - the object is already guaranteed to be persistent. 3) Removed some gratuitous casts.
# 9340	27-Jun-1995	bde	Pass the correct nonblocking flag to VOP_CLOSE() in vclean(). VOP_CLOSE() takes `F' (file) flags, not `IO' flags. At least that's what close() passes. I previously fixed ttylclose() to check FNONBLOCK instead of IO_NDELAY. This broke the call from vclean() and cleaning of ptys sometimes deadlocked.
# 8692	21-May-1995	dg	Changes to fix the following bugs: 1) Files weren't properly synced on filesystems other than UFS. In some cases, this lead to lost data. Most likely would be noticed on NFS. The fix is to make the VM page sync/object_clean general rather than in each filesystem. 2) Mixing regular and mmaped file I/O on NFS was very broken. It caused chunks of files to end up as zeroes rather than the intended contents. The fix was to fix several race conditions and to kludge up the "b_dirtyoff" and "b_dirtyend" that NFS relies upon - paying attention to page modifications that occurred via the mmapping. Reviewed by: David Greenman Submitted by: John Dyson
# 8465	12-May-1995	dg	Increased ratio of allowed vnodes on freelist to 1/4th of the total. This is more representative of worst case situations of 4 files/directory. (If that last sentence doesn't make any sense, I'm not surprised. It's rather compilcated how this all fits together....). This should fix a problem that Ed Hudson has been complaining about where directories with lots of symlinks could cause excessive disk I/O.
# 7877	16-Apr-1995	dg	Changed #ifdef around printlockedvnodes() from DEBUG to DDB.
# 7694	09-Apr-1995	dg	Changes from John Dyson and myself: Fixed remaining known bugs in the buffer IO and VM system. vfs_bio.c: Fixed some race conditions and locking bugs. Improved performance by removing some (now) unnecessary code and fixing some broken logic. Fixed process accounting of # of FS outputs. Properly handle NFS interrupts (B_EINTR). (various) Replaced calls to clrbuf() with calls to an optimized routine call vfs_bio_clrbuf(). (various FS sync) Sync out modified vnode_pager backed pages. ffs_vnops.c: Do two passes: Sync out file data first, then indirect blocks. vm_fault.c: Fixed deadly embrace caused by acquiring locks in the wrong order. vnode_pager.c: Changed to use buffer I/O system for writing out modified pages. This should fix the problem with the modification date previous not getting updated. Also dramatically simplifies the code. Note that this is going to change in the future and be implemented via VOP_PUTPAGES(). vm_object.c: Fixed a pile of bugs related to cleaning (vnode) objects. The performance of vm_object_page_clean() is terrible when dealing with huge objects, but this will change when we implement a binary tree to keep the object pages sorted. vm_pageout.c: Fixed broken clustering of pageouts. Fixed race conditions and other lockup style bugs in the scanning of pages. Improved performance.
# 7205	21-Mar-1995	dg	Fixed vinvalbuf() to work like NFS wants it to. The previous code wouldn't flush pages in the vm object if V_SAVE was true.
# 7186	20-Mar-1995	dg	Don't gain/lose a reference to the object when yanking its pages in vinvalbuf()...it will cause vnode locking problems in vm_object_terminate, and isn't necessary anyway.
# 7181	20-Mar-1995	dg	Don't attempt to sync pages in the V_SAVE case of vinvalbuf; doing so can lead to a deadlock. Just let the VM system deal with it.
# 7090	16-Mar-1995	bde	Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) and most of the warnings from `gcc -Wnested-externs'. Fix all the bugs found. There were no serious ones.
# 7010	11-Mar-1995	dg	Added a comment.
# 6991	10-Mar-1995	dg	Reorganized an if() expression for efficiency.
# 6969	09-Mar-1995	phk	Clean up and improve the namecache. 1. We always keep one 16th of the vnodes on the freelist, so that the namecache doesn't get trashed. It used to be that it wasn't a problem, but the only vnodes getting released these days are directories and things which Clean up and improve the namecache. 1. We always keep one 16th of the vnodes on the freelist, so that the namecache doesn't get trashed. It used to be that it wasn't a problem, but the only vnodes getting released these days are directories and things which gets forced out of the VM/cache. The latter is not numerous enough to keep the pool of vnodes needed for the namecache sufficiently big. 2. Purge invalid entries in the namecache as soon as we notice them. This avoids a stale entry pushing out a valid entry on the LRU list. 3. Speed up the lookup in the namecache by avoid a special case branch. 4. Make the cache purge routines do the thing they're supposed to, and in a decently efficient manner. 5. Make the size of the namecache follow the number of vnodes, so that we can always point to all the vnodes we have in core. 6. Readability has gone way up. 7. Added a "options NCH_STATISTICS" feature that will gather more detailed statistics on the performance of the namecache. Reviewed by: davidg (cvs is dumping core on me :-( )
# 6945	07-Mar-1995	dg	Put VAGE vnodes at the head of the free list.
# 6762	27-Feb-1995	dg	Backed out previous change. I forgot (for about the fourth time) that v_rdev is a #define which is dereferenced through v_specinfo->si_rdev, and that isn't initialized until later in checkalias().
# 6758	27-Feb-1995	dg	Initialize v_rdev in getnewvnode() - it appears that some filesystems may not properly initialize this field in all cases, and this would result in very anti-social behavior (overwriting on some other random device/location). Submitted by: John Dyson
# 6621	22-Feb-1995	dg	vfs_cluster.c: Various more tweaks from John Dyson to improve read ahead calculations. vfs_subr.c: Only wakeup if numoutput is 0 in vwakeup(). Submitted by: John Dyson
# 5464	10-Jan-1995	dg	Fixed some formatting weirdness that I overlooked in the previous commit.
# 5455	09-Jan-1995	dg	These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
# 5201	23-Dec-1994	dg	Protect vnode buffer chain manipulation with splbio to prevent list corruption..
# 3396	06-Oct-1994	dg	Use tsleep() rather than sleep so that 'ps' is more informative about the wait.
# 3374	05-Oct-1994	dg	Stuff object into v_vmdata rather than pager. Not important which at the moment, but will be in the future. Other changes mostly cosmetic, but are made for future VMIO considerations. Submitted by: John Dyson
# 3308	02-Oct-1994	phk	All of this is cosmetic. prototypes, #includes, printfs and so on. Makes GCC a lot more silent.
# 3098	25-Sep-1994	phk	While in the real world, I had a bad case of being swapped out for a lot of cycles. While waiting there I added a lot of the extra ()'s I have, (I have never used LISP to any extent). So I compiled the kernel with -Wall and shut up a lot of "suggest you add ()'s", removed a bunch of unused var's and added a couple of declarations here and there. Having a lap-top is highly recommended. My kernel still runs, yell at me if you kernel breaks.
# 2384	29-Aug-1994	dg	"bogus" fixes from 1.1.5 to work around some cache coherency problems.
# 2250	24-Aug-1994	dg	Initialized v_writecount.
# 2220	22-Aug-1994	dg	print "BUSY" instead of error number if filesystem was busy during vfs_unmountall() - this is the most common case. If it was a different error, then print the error number.
# 2152	20-Aug-1994	dg	Implemented filesystem clean bit via: machdep.c: Changed printf's a little and call vfs_unmountall() if the sync was successful. cd9660_vfsops.c, ffs_vfsops.c, nfs_vfsops.c, lfs_vfsops.c: Allow dismount of root FS. It is now disallowed at a higher level. vfs_conf.c: Removed unused rootfs global. vfs_subr.c: Added new routines vfs_unmountall and vfs_unmountroot. Filesystems are now dismounted if the machine is properly rebooted. ffs_vfsops.c: Toggle clean bit at the appropriate places. Print warning if an unclean FS is mounted. ffs_vfsops.c, lfs_vfsops.c: Fix bug in selecting proper flags for VOP_CLOSE(). vfs_syscalls.c: Disallow dismounting root FS via umount syscall.
# 2112	18-Aug-1994	wollman	Fix up some sloppy coding practices: - Delete redundant declarations. - Add -Wredundant-declarations to Makefile.i386 so they don't come back. - Delete sloppy COMMON-style declarations of uninitialized data in header files. - Add a few prototypes. - Clean up warnings resulting from the above. NB: ioconf.c will still generate a redundant-declaration warning, which is unavoidable unless somebody volunteers to make `config' smarter.
# 1817	02-Aug-1994	dg	Added $Id$
# 1549	25-May-1994	rgrimes	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
# 1542	24-May-1994	rgrimes	This commit was generated by cvs2svn to compensate for changes in r1541, which included commits to RCS files with non-trunk default branches.
# 1541	24-May-1994	rgrimes	BSD 4.4 Lite Kernel Sources