Cross Reference: /freebsd-10.0-release/sys/ufs/

History log of /freebsd-10.0-release/sys/ufs/
Revision	Date	Author	Comments (<<< Hide modified files) (Show modified files >>>)
259065	07-Dec-2013	gjb	- Copy stable/10 (r259064) to releng/10.0 as part of the 10.0-RELEASE cycle. - Update __FreeBSD_version [1] - Set branch name to -RC1 [1] 10.0-CURRENT __FreeBSD_version value ended at '55', so start releng/10.0 at '100' so the branch is started with a value ending in zero. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation /freebsd-10.0-release/sys/conf/newvers.sh /freebsd-10.0-release/sys/sys/param.h
256281	10-Oct-2013	gjb	Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation
255219	05-Sep-2013	pjd	Change the cap_rights_t type from uint64_t to a structure that we can extend in the future in a backward compatible (API and ABI) way. The cap_rights_t represents capability rights. We used to use one bit to represent one right, but we are running out of spare bits. Currently the new structure provides place for 114 rights (so 50 more than the previous cap_rights_t), but it is possible to grow the structure to hold at least 285 rights, although we can make it even larger if 285 rights won't be enough. The structure definition looks like this: struct cap_rights { uint64_t cr_rights[CAP_RIGHTS_VERSION + 2]; }; The initial CAP_RIGHTS_VERSION is 0. The top two bits in the first element of the cr_rights[] array contain total number of elements in the array - 2. This means if those two bits are equal to 0, we have 2 array elements. The top two bits in all remaining array elements should be 0. The next five bits in all array elements contain array index. Only one bit is used and bit position in this five-bits range defines array index. This means there can be at most five array elements in the future. To define new right the CAPRIGHT() macro must be used. The macro takes two arguments - an array index and a bit to set, eg. #define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL) We still support aliases that combine few rights, but the rights have to belong to the same array element, eg: #define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL) #define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL) #define CAP_FCHMODAT (CAP_FCHMOD \| CAP_LOOKUP) There is new API to manage the new cap_rights_t structure: cap_rights_t cap_rights_init(cap_rights_t rights, ...); void cap_rights_set(cap_rights_t rights, ...); void cap_rights_clear(cap_rights_t rights, ...); bool cap_rights_is_set(const cap_rights_t rights, ...); bool cap_rights_is_valid(const cap_rights_t rights); void cap_rights_merge(cap_rights_t dst, const cap_rights_t src); void cap_rights_remove(cap_rights_t dst, const cap_rights_t src); bool cap_rights_contains(const cap_rights_t big, const cap_rights_t little); Capability rights to the cap_rights_init(), cap_rights_set(), cap_rights_clear() and cap_rights_is_set() functions are provided by separating them with commas, eg: cap_rights_t rights; cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT); There is no need to terminate the list of rights, as those functions are actually macros that take care of the termination, eg: #define cap_rights_set(rights, ...) \ __cap_rights_set((rights), __VA_ARGS__, 0ULL) void __cap_rights_set(cap_rights_t *rights, ...); Thanks to using one bit as an array index we can assert in those functions that there are no two rights belonging to different array elements provided together. For example this is illegal and will be detected, because CAP_LOOKUP belongs to element 0 and CAP_PDKILL to element 1: cap_rights_init(&rights, CAP_LOOKUP \| CAP_PDKILL); Providing several rights that belongs to the same array's element this way is correct, but is not advised. It should only be used for aliases definition. This commit also breaks compatibility with some existing Capsicum system calls, but I see no other way to do that. This should be fine as Capsicum is still experimental and this change is not going to 9.x. Sponsored by: The FreeBSD Foundation
254996	28-Aug-2013	mckusick	In looking at block layouts as part of fixing filesystem block allocations under low free-space conditions (-r254995), determine that old block-preference search order used before -r249782 worked a bit better. This change reverts to that block-preference search order. MFC after: 2 weeks
254995	28-Aug-2013	mckusick	A performance problem was reported in PR kern/181226: I have 25TB Dell PERC 6 RAID5 array. When it becomes almost full (10-20GB free), processes which write data to it start eating 100% CPU and write speed drops below 1MB/sec (normally to gives 400MB/sec). The revision at which it first became apparent was http://svnweb.freebsd.org/changeset/base/249782. The offending change reserved an area in each cylinder group to store metadata. The new algorithm attempts to save this area for metadata and allows its use for non-metadata only after all the data areas have been exhausted. The size of the reserved area defaults to half of minfree, so the filesystem reports full before the data area can completely fill. However, in this report, the filesystem has had minfree reduced to 1% thus forcing the metadata area to be used for data. As the filesystem approached full, it had only metadata areas left to allocate. The result was that every block allocation had to scan summary data for 30,000 cylinder groups before falling back to searching up to 30,000 metadata areas. The fix is to give up on saving the metadata areas once the free space reserve drops below 2%. The effect of this change is to use the old algorithm of just accepting the first available block that we find. Since most filesystems use the default 5% minfree, this will have no effect on their operation. For those that want to push to the limit, they will get their crappy block placements quickly. Submitted by: Dmitry Sivachenko Fix Tested by: Dmitry Sivachenko PR: kern/181226 MFC after: 2 weeks
254986	28-Aug-2013	ivoras	Take a very small step toward the Century of the Anchovy by increasing the time dirhash entries stay in memory before being considered for eviction to 1 minute.
254627	21-Aug-2013	ken	Expand the use of stat(2) flags to allow storing some Windows/DOS and CIFS file attributes as BSD stat(2) flags. This work is intended to be compatible with ZFS, the Solaris CIFS server's interaction with ZFS, somewhat compatible with MacOS X, and of course compatible with Windows. The Windows attributes that are implemented were chosen based on the attributes that ZFS already supports. The summary of the flags is as follows: UF_SYSTEM: Command line name: "system" or "usystem" ZFS name: XAT_SYSTEM, ZFS_SYSTEM Windows: FILE_ATTRIBUTE_SYSTEM This flag means that the file is used by the operating system. FreeBSD does not enforce any special handling when this flag is set. UF_SPARSE: Command line name: "sparse" or "usparse" ZFS name: XAT_SPARSE, ZFS_SPARSE Windows: FILE_ATTRIBUTE_SPARSE_FILE This flag means that the file is sparse. Although ZFS may modify this in some situations, there is not generally any special handling for this flag. UF_OFFLINE: Command line name: "offline" or "uoffline" ZFS name: XAT_OFFLINE, ZFS_OFFLINE Windows: FILE_ATTRIBUTE_OFFLINE This flag means that the file has been moved to offline storage. FreeBSD does not have any special handling for this flag. UF_REPARSE: Command line name: "reparse" or "ureparse" ZFS name: XAT_REPARSE, ZFS_REPARSE Windows: FILE_ATTRIBUTE_REPARSE_POINT This flag means that the file is a Windows reparse point. ZFS has special handling code for reparse points, but we don't currently have the other supporting infrastructure for them. UF_HIDDEN: Command line name: "hidden" or "uhidden" ZFS name: XAT_HIDDEN, ZFS_HIDDEN Windows: FILE_ATTRIBUTE_HIDDEN This flag means that the file may be excluded from a directory listing if the application honors it. FreeBSD has no special handling for this flag. The name and bit definition for UF_HIDDEN are identical to the definition in MacOS X. UF_READONLY: Command line name: "urdonly", "rdonly", "readonly" ZFS name: XAT_READONLY, ZFS_READONLY Windows: FILE_ATTRIBUTE_READONLY This flag means that the file may not written or appended, but its attributes may be changed. ZFS currently enforces this flag, but Illumos developers have discussed disabling enforcement. The behavior of this flag is different than MacOS X. MacOS X uses UF_IMMUTABLE to represent the DOS readonly permission, but that flag has a stronger meaning than the semantics of DOS readonly permissions. UF_ARCHIVE: Command line name: "uarch", "uarchive" ZFS_NAME: XAT_ARCHIVE, ZFS_ARCHIVE Windows name: FILE_ATTRIBUTE_ARCHIVE The UF_ARCHIVED flag means that the file has changed and needs to be archived. The meaning is same as the Windows FILE_ATTRIBUTE_ARCHIVE attribute, and the ZFS XAT_ARCHIVE and ZFS_ARCHIVE attribute. msdosfs and ZFS have special handling for this flag. i.e. they will set it when the file changes. sys/param.h: Bump __FreeBSD_version to 1000047 for the addition of new stat(2) flags. chflags.1: Document the new command line flag names (e.g. "system", "hidden") available to the user. ls.1: Reference chflags(1) for a list of file flags and their meanings. strtofflags.c: Implement the mapping between the new command line flag names and new stat(2) flags. chflags.2: Document all of the new stat(2) flags, and explain the intended behavior in a little more detail. Explain how they map to Windows file attributes. Different filesystems behave differently with respect to flags, so warn the application developer to take care when using them. zfs_vnops.c: Add support for getting and setting the UF_ARCHIVE, UF_READONLY, UF_SYSTEM, UF_HIDDEN, UF_REPARSE, UF_OFFLINE, and UF_SPARSE flags. All of these flags are implemented using attributes that ZFS already supports, so the on-disk format has not changed. ZFS currently doesn't allow setting the UF_REPARSE flag, and we don't really have the other infrastructure to support reparse points. msdosfs_denode.c, msdosfs_vnops.c: Add support for getting and setting UF_HIDDEN, UF_SYSTEM and UF_READONLY in MSDOSFS. It supported SF_ARCHIVED, but this has been changed to be UF_ARCHIVE, which has the same semantics as the DOS archive attribute instead of inverse semantics like SF_ARCHIVED. After discussion with Bruce Evans, change several things in the msdosfs behavior: Use UF_READONLY to indicate whether a file is writeable instead of file permissions, but don't actually enforce it. Refuse to change attributes on the root directory, because it is special in FAT filesystems, but allow most other attribute changes on directories. Don't set the archive attribute on a directory when its modification time is updated. Windows and DOS don't set the archive attribute in that scenario, so we are now bug-for-bug compatible. smbfs_node.c, smbfs_vnops.c: Add support for UF_HIDDEN, UF_SYSTEM, UF_READONLY and UF_ARCHIVE in SMBFS. This is similar to changes that Apple has made in their version of SMBFS (as of smb-583.8, posted on opensource.apple.com), but not quite the same. We map SMB_FA_READONLY to UF_READONLY, because UF_READONLY is intended to match the semantics of the DOS readonly flag. The MacOS X code maps both UF_IMMUTABLE and SF_IMMUTABLE to SMB_FA_READONLY, but the immutable flags have stronger meaning than the DOS readonly bit. stat.h: Add definitions for UF_SYSTEM, UF_SPARSE, UF_OFFLINE, UF_REPARSE, UF_ARCHIVE, UF_READONLY and UF_HIDDEN. The definition of UF_HIDDEN is the same as the MacOS X definition. Add commented-out definitions of UF_COMPRESSED and UF_TRACKED. They are defined in MacOS X (as of 10.8.2), but we do not implement them (yet). ufs_vnops.c: Add support for getting and setting UF_ARCHIVE, UF_HIDDEN, UF_OFFLINE, UF_READONLY, UF_REPARSE, UF_SPARSE, and UF_SYSTEM in UFS. Alphabetize the flags that are supported. These new flags are only stored, UFS does not take any action if the flag is set. Sponsored by: Spectra Logic Reviewed by: bde (earlier version)
253998	06-Aug-2013	mckusick	This bug fix is in a code path in rename taken when there is a collision between a rename and an open system call for the same target file. Here, rename releases its vnode references, waits for the open to finish, and then restarts by reacquiring its needed vnode locks. In this case, rename was unlocking but failing to release its reference to one of its held vnodes. The effect was that even after all the actual references to the vnode had gone, the vnode still showed active references. For files that had been removed, their space was not reclaimed until the filesystem was forcibly unmounted. This bug manifested itself in the Postgres server which would leak/lose hundreds of files per day amounting to many gigabytes of disk space. This bug required shutting down Postgres, forcibly unmounting its filesystem, remounting its filesystem and restarting Postgres every few days to recover the lost space. Reported by: Dan Thomas and Palle Girgensohn Bug-fix by: kib Tested by: Dan Thomas and Palle Girgensohn MFC after: 2 weeks
253974	05-Aug-2013	mckusick	With the addition of journalled soft updates, the "newblk" structures persist much longer than previously. Historically we had at most 100 entries; now the count may reach a million. With the increased count we spent far too much time looking them up in the grossly undersized newblk hash table. Configure the newblk hash table to accurately reflect the number of entries that it must index. Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks
253973	05-Aug-2013	mckusick	To better understand performance problems with journalled soft updates, we need to collect the highest level of allocation for each of the different soft update dependency structures. This change collects these statistics and makes them available using `sysctl debug.softdep.highuse'. Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks
253341	14-Jul-2013	mckusick	Update to comments describing block allocation policy. Submitted by: Bruce Evans
253280	12-Jul-2013	kib	Only copy as much bytes as there in superblock, instead of the full block copy, when copying the superblock into the snapshot. UFS1 does not align superblock on the block boundary, and bcopy runs off the end of the buffer. Reported by: Andre Albsmeier <Andre.Albsmeier@siemens.com> Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week
253163	10-Jul-2013	pfg	Change i_gen in UFS to an unsigned type. Missing type change from r252435. This fixes a "Stale NFS file handle" error. Reported by: Claude Bisson Tested by: Claude Bisson Pointed hat: pfg
253106	09-Jul-2013	kib	There are several code sequences like vfs_busy(mp); vfs_write_suspend(mp); which are problematic if other thread starts unmount between two calls. The unmount starts a write, while vfs_write_suspend() drain writers. On the other hand, unmount drains busy references, causing the deadlock. Add a flag argument to vfs_write_suspend and require the callers of it to specify VS_SKIP_UNMOUNT flag, when the call is performed not in the mount path, i.e. the covered vnode is not locked. The suspension is not attempted if VS_SKIP_UNMOUNT is specified and unmount is in progress. Reported and tested by: Andreas Longwitz <longwitz@incore.de> Sponsored by: The FreeBSD Foundation MFC after: 3 weeks
252527	02-Jul-2013	mckusick	Make better use of metadata area by avoiding using it for data blocks that no should no longer immediately follow their indirect blocks. MFC after: 2 weeks
252515	02-Jul-2013	pfg	Style fix: spaces. Cleanup the incomplete revert. Reported by: bde MFC after: 4 weeks
252484	01-Jul-2013	pfg	Change i_gen in UFS to an unsigned type. Revert the simplification of the i_gen calculation. It is still a good idea to avoid zero values and for the case of old filesystems there is probably no advantage in using the complete 32 bits anyways. Discussed with: bde MFC after: 4 weeks
252467	01-Jul-2013	pfg	Change i_gen in UFS to an unsigned type. Further simplify the i_gen calculation for older disks. Having a zero here is not really a problem and this is more similar to what is done in newfs_random(). Reported by: Xin Li MFC after: 4 weeks
252438	01-Jul-2013	gleb	Don't assume that UFS on-disk format of a directory is the same as defined by <sys/dirent.h> Always start parsing at DIRBLKSIZ aligned offset, skip first entries if uio_offset is not DIRBLKSIZ aligned. Return EINVAL if buffer is too small for single entry. Preallocate buffer for cookies. Cookies will be replaced with d_off field in struct dirent at later point. Skip entries with zero inode number. Stop mangling dirent in ufs_extattr_iterate_directory(). Reviewed by: kib Sponsored by: Google Summer Of Code 2011
252437	01-Jul-2013	pfg	Change i_gen in UFS to an unsigned type. Missed format specifier. Reported by: mdf MFC after: 4 weeks
252435	01-Jul-2013	pfg	Change i_gen in UFS to an unsigned type. In UFS, i_gen is a random generated value and there is not way for it to be negative. Actually, the value of i_gen is just used to match bit patterns and it is of not consequence if the values are signed or not. Following other filesystems, set it to unsigned and use it as such, Discussed by: mckusick Reviewed by: mckusick (previous version) MFC after: 4 weeks
251171	31-May-2013	jeff	- Convert the bufobj lock to rwlock. - Use a shared bufobj lock in getblk() and inmem(). - Convert softdep's lk to rwlock to match the bufobj lock. - Move INFREECNT to b_flags and protect it with the buf lock. - Remove unnecessary locking around bremfree() and BKGRDINPROG. Sponsored by: EMC / Isilon Storage Division Discussed with: mckusick, kib, mdf
250901	22-May-2013	mckusick	Properly spell sentinel (missed in 250891) No functional changes. Spotted by: Navdeep Parhar and Alexey Dokuchaev MFC after: 2 weeks
250897	22-May-2013	mckusick	Add missing buffer releases (brelse) after bread calls that return an error. One could argue that returning a buffer even when it is not valid is incorrect, but bread has always returned a buffer valid or not. Reviewed by: kib MFC after: 2 weeks
250895	22-May-2013	mckusick	Add missing 28th element to softdep types name array. Found by: Coverity Scan, CID 1007621 Reviewed by: kib MFC after: 2 weeks
250894	22-May-2013	mckusick	Null a pointer after it is freed so that when it is returned the return value is NULL. Based on the returned flags, the return value should never be inspected in the case where NULL is returned, but it is good coding practice not to return a pointer to freed memory. Found by: Coverity Scan, CID 1006096 Reviewed by: kib MFC after: 2 weeks
250892	22-May-2013	mckusick	Remove a bogus check for a NULL buffer pointer. Add a KASSERT that it is not NULL. Found by: Coverity Scan, CID 1009114 Reviewed by: kib MFC after: 2 weeks
250891	22-May-2013	mckusick	Properly spell sentinel (not sintenel or sentinal). No functional changes. Spotted by: kib MFC after: 2 weeks
250576	12-May-2013	eadler	Fix several typos PR: kern/176054 Submitted by: Christoph Mallon <christoph.mallon@gmx.de> MFC after: 3 days
249582	17-Apr-2013	gabor	- Correct mispellings of the word occurrence Submitted by: Christoph Mallon <christoph.mallon@gmx.de> (via private mail)
249218	06-Apr-2013	jeff	Prepare to replace the buf splay with a trie: - Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists. No consumers need to find them there and it complicates the tree. These flags are all FFS specific and could be moved out of the buf cache. - Use pbgetvp() and pbrelvp() to associate the background and journal bufs with the vp. Not only is this much cheaper it makes more sense for these transient bufs. - Fix the assertions in pbget* and pbrel*. It's not safe to check list pointers which were never initialized. Use the BX flags instead. We also check B_PAGING in reassignbuf() so this should cover all cases. Discussed with: kib, mckusick, attilio Sponsored by: EMC / Isilon Storage Division
249064	03-Apr-2013	mckusick	The code in clear_remove() and clear_inodedeps() skips one entry in the pagedep and inodedep hash tables. An entry in the table is skipped because 'pagedep_hash' and 'inodedep_hash' hold the size of the hash tables - 1. The chance that this would have any operational failure is extremely unlikely. These funtions only need to find a single entry and are only called when there are too many entries. The chance that they would fail because all the entries are on the single skipped hash chain are remote. Submitted by: Pedro Martelletto Reviewed by: kib MFC after: 2 weeks
248623	22-Mar-2013	mckusick	The purpose of this change to the FFS layout policy is to reduce the running time for a full fsck. It also reduces the random access time for large files and speeds the traversal time for directory tree walks. The key idea is to reserve a small area in each cylinder group immediately following the inode blocks for the use of metadata, specifically indirect blocks and directory contents. The new policy is to preferentially place metadata in the metadata area and everything else in the blocks that follow the metadata area. The size of this area can be set when creating a filesystem using newfs(8) or changed in an existing filesystem using tunefs(8). Both utilities use the `-k held-for-metadata-blocks' option to specify the amount of space to be held for metadata blocks in each cylinder group. By default, newfs(8) sets this area to half of minfree (typically 4% of the data area). This work was inspired by a paper presented at Usenix's FAST '13: www.usenix.org/conference/fast13/ffsck-fast-file-system-checker Details of this implementation appears in the April 2013 of ;login: www.usenix.org/publications/login/april-2013-volume-38-number-2. A copy of the April 2013 ;login: paper can also be downloaded from: www.mckusick.com/publications/faster_fsck.pdf. Reviewed by: kib Tested by: Peter Holm MFC after: 4 weeks
248561	20-Mar-2013	mckusick	When renaming a directory from one parent directory to another, we need to call ufs_checkpath() to walk from our new location to the root of the filesystem to ensure that we do not encounter ourselves along the way. Until now, we accomplished this by reading the ".." entries of each directory in our path until we reached the root (or encountered an error). This change tries to avoid the I/O of reading the ".." entries by first looking them up in the name cache and only doing the I/O when the name cache lookup fails. Reviewed by: kib Tested by: Peter Holm MFC after: 4 weeks
248521	19-Mar-2013	kib	UFS support of the unmapped i/o for the user data buffers. Sponsored by: The FreeBSD Foundation Tested by: pho, scottl, jhb, bf
248515	19-Mar-2013	kib	Do not remap usermode pages into KVA for physio. Sponsored by: The FreeBSD Foundation Tested by: pho
248422	17-Mar-2013	kib	Remove negative name cache entry pointing to the target name, which could be instantiated while tdvp was unlocked. Reported by: Rick Miller <vmiller at hostileadmin com> Tested by: pho MFC after: 1 week
248283	14-Mar-2013	kib	Some style fixes. Sponsored by: The FreeBSD Foundation
248282	14-Mar-2013	kib	Add currently unused flag argument to the cluster_read(), cluster_write() and cluster_wbuild() functions. The flags to be allowed are a subset of the GB_* flags for getblk(). Sponsored by: The FreeBSD Foundation Tested by: pho
248084	09-Mar-2013	attilio	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
247388	27-Feb-2013	kib	The softdep freeblks workitem might hold a reference on the dquot. Current dqflush() panics when a dquot with with non-zero refcount is encountered. The situation is possible, because quotas are turned off before softdep workitem queue if flushed, due to the quota file writes might create softdep workitems. Make the encountering an active dquot in dqflush() not fatal, return the error from quotaoff() instead. Ignore the quotaoff() failures when ffs_flushfiles() is called in the course of softdep_flushfiles() loop, until the last iteration. At the last loop, the quotas must be closed, and because SU workitems should be already flushed, the references to dquot are gone. Sponsored by: The FreeBSD Foundation Reported and tested by: pho Reviewed by: mckusick MFC after: 2 weeks
247387	27-Feb-2013	kib	An inode block must not be blockingly read while cg block is owned. The order is inode buffer lock -> snaplk -> cg buffer lock, reversing the order causes deadlocks. Inode block must not be written while cg block buffer is owned. The FFS copy on write needs to allocate a block to copy the content of the inode block, and the cylinder group selected for the allocation might be the same as the owned cg block. The reserved block detection code in the ffs_copyonwrite() and ffs_bp_snapblk() is unable to detect the situation, because the locked cg buffer is not exposed to it. In order to maintain the dependency between initialized inode block and the cg_initediblk pointer, look up the inode buffer in non-blocking mode. If succeeded, brelse cg block, initialize the inode block and write it. After the write is finished, reread cg block and update the cg_initediblk. If inode block is already locked by another thread, let the another thread initialize it. If another thread raced with us after we started writing inode block, the situation is detected by an update of cg_initediblk. Note that double-initialization of the inode block is harmless, the block cannot be used until cg_initediblk is incremented. Sponsored by: The FreeBSD Foundation In collaboration with: pho Reviewed by: mckusick MFC after: 1 month X-MFC-note: after r246877
246877	16-Feb-2013	mckusick	The UFS2 filesystem allocates new blocks of inodes as they are needed. When a cylinder group runs short of inodes, a new block for inodes is allocated, zero'ed, and written to the disk. The zero'ed inodes must be on the disk before the cylinder group can be updated to claim them. If the cylinder group claiming the new inodes were written before the zero'ed block of inodes, the system could crash with the filesystem in an unrecoverable state. Rather than adding a soft updates dependency to ensure that the new inode block is written before it is claimed by the cylinder group map, we just do a barrier write of the zero'ed inode block to ensure that it will get written before the updated cylinder group map can be written. This change should only slow down bulk loading of newly created filesystems since that is the primary time that new inode blocks need to be created. Reported by: Robert Watson Reviewed by: kib Tested by: Peter Holm
246612	10-Feb-2013	kib	Fix several unsafe pointer dereferences in the buffered_write() function, implementing the sysctl vfs.ffs.set_bufoutput (not used in the tree yet). - The current directory vnode dereference is unsafe since fd_cdir could be changed and unreferenced, lock the filedesc around and vref the fd_cdir. - The VTOI() conversion of the fd_cdir is unsafe without first checking that the vnode is indeed from an FFS mount, otherwise the code dereferences a random memory. - The cdir could be reclaimed from under us, lock it around the checks. - The type of the fp vnode might be not a disk, or it might have changed while the thread was in flight, check the type. Reviewed and tested by: mckusick MFC after: 2 weeks
246562	08-Feb-2013	pfg	Remove unused MAXSYMLINKLEN macro. Reviewed by: mckusick PR: kern/175794 MFC after: 1 week
246299	03-Feb-2013	pfg	UFS: Remove dead assignment. Submitted by: Christoph Mallon MFC after: 3 days
246289	03-Feb-2013	mckusick	For UFS2 i_blocks is unsigned. The current "sanity" check that it has gone below zero after the blocks in its inode are freed is a no-op which the compiler fails to warn about because of the use of the DIP macro. Change the sanity check to compare the number of blocks being freed against the value i_blocks. If the number of blocks being freed exceeds i_blocks, just set i_blocks to zero. Reported by: Pedro Giffuni (pfg@) MFC after: 2 weeks
245286	11-Jan-2013	kib	Add flags argument to vfs_write_resume() and remove vfs_write_resume_flags(). Sponsored by: The FreeBSD Foundation
244925	01-Jan-2013	kib	The process_deferred_inactive() function locks the vnodes of the ufs mount, which means that is must not be called while the snaplock is owned. The vfs_write_resume(9) does call the function as the VFS_SUSP_CLEAN() method, which is too early and falls into the region still protected by snaplock. Add yet another flag for the vfs_write_resume_flags() to avoid calling suspension cleanup handler after the suspend is lifted, and use it in the ffs_snapshot() call to vfs_write_resume. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
244795	28-Dec-2012	kib	Make it possible to atomically resume writes on the mount and account the write start, by adding a variation of the vfs_write_resume(9) which accepts flags. Use the new function to prevent a deadlock between parallel suspension and snapshotting a UFS mount. The ffs_snapshot() code performed vfs_write_resume() followed by vn_start_write() while owning the snaplock. If the suspension intervene between resume and vn_start_write(), the deadlock occured after the suspending thread tried to lock the snaplock, most typically during the write in the ffs_copyonwrite(). Reported and tested by: Andreas Longwitz <longwitz@incore.de> Reviewed by: mckusick MFC after: 2 weeks X-MFC-note: make the vfs_write_resume(9) function a macro after the MFC, in HEAD
244534	21-Dec-2012	attilio	Fixup r218424: uio_yield() was scaling directly to userland priority. When kern_yield() was introduced with the possibility to specify a new priority, the behaviour changed by not lowering priority at all in the consumers, making the yielding mechanism highly ineffective for high priority kthreads like bufdaemon, syncer, vlrudaemon, etc. There are no evidences that consumers could bear with such change in semantic and this situation could finally lead to bugs similar to the ones fixed in r244240. Re-specify userland pri for kthreads involved. Tested by: pho Reviewed by: kib, mdf MFC after: 1 week
244239	15-Dec-2012	kib	Fix a typo, resulting in the NULL pointer dereference. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days
243311	19-Nov-2012	attilio	r16312 is not any longer real since many years (likely since when VFS received granular locking) but the comment present in UFS has been copied all over other filesystems code incorrectly for several times. Removes comments that makes no sense now. Reviewed by: kib MFC after: 3 days
243250	18-Nov-2012	trasz	Fix build of kdump(1).
243245	18-Nov-2012	trasz	Add UFS writesuspension mechanism, designed to allow userland processes to modify on-disk metadata for filesystems mounted for write. Reviewed by: kib, mckusick Sponsored by: FreeBSD Foundation
243018	14-Nov-2012	jeff	- Fix a truncation bug with softdep journaling that could leak blocks on crash. When truncating a file that never made it to disk we use the canceled allocation dependencies to hold the journal records until the truncation completes. Previously allocdirect dependencies on the id_bufwait list were not considered and their journal space could expire before the bitmaps were written. Cancel them and attach them to the freeblks as we do for other allocdirects. - Add KTR traces that were used to debug this problem. - When adding jsegdeps, always use jwork_insert() so we don't have more than one segdep on a given jwork list. Sponsored by: EMC / Isilon Storage Division
242924	12-Nov-2012	jeff	- Fix a bug that has existed since the original softdep implementation. When a background copy of a cg is written we complete any work associated with that bmsafemap. If new work has been added to the non-background copy of the buffer it will be completed before the next write happens. The solution is to do the rollbacks when we make the copy so only those dependencies that were present at the time of writing will be completed when the background write completes. This would've resulted in various bitmap related corruptions and panics. It also would've expired journal entries early causing journal replay to miss some records. MFC after: 2 weeks
242833	09-Nov-2012	attilio	Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag. Porters should refer to __FreeBSD_version 1000021 for this change as it may have happened at the same timeframe.
242815	09-Nov-2012	jeff	- Correct rev 242734, segments can sometimes get stuck. Be a bit more defensive with segment state. Reported by: b. f. <bf1783@googlemail.com>
242734	08-Nov-2012	jeff	- Implement BIO_FLUSH support around journal entries. This will not 100% solve power loss problems with dishonest write caches. However, it should improve the situation and force a full fsck when it is unable to resolve with the journal. - Resolve a case where the journal could wrap in an unsafe way causing us to prematurely lose journal entries in very specific scenarios. Discussed with: mckusick MFC after: 1 month
242520	03-Nov-2012	mckusick	When a file is first being written, the dynamic block reallocation (implemented by ffs_reallocblks_ufs[12]) relocates the file's blocks so as to cluster them together into a contiguous set of blocks on the disk. When the cluster crosses the boundary into the first indirect block, the first indirect block is initially allocated in a position immediately following the last direct block. Block reallocation would usually destroy locality by moving the indirect block out of the way to keep the data blocks contiguous. This change compensates for this problem by noting that the first indirect block should be left immediately following the last direct block. It then tries to start a new cluster of contiguous blocks (referenced by the indirect block) immediately following the indirect block. We should also do this for other indirect block boundaries, but it is only important for the first one. Suggested by: Bruce Evans MFC: 2 weeks
242492	02-Nov-2012	jeff	- In cancel_mkdir_dotdot don't panic if the inodedep is not available. If the previous diradd had already finished it could have been reclaimed already. This would only happen under heavy dependency pressure. Reported by: Andrey Zonov <zont@FreeBSD.org> Discussed with: mckusick MFC after: 1 week
242476	02-Nov-2012	kib	The r241025 fixed the case when a binary, executed from nullfs mount, was still possible to open for write from the lower filesystem. There is a symmetric situation where the binary could already has file descriptors opened for write, but it can be executed from the nullfs overlay. Handle the issue by passing one v_writecount reference to the lower vnode if nullfs vnode has non-zero v_writecount. Note that only one write reference can be donated, since nullfs only keeps one use reference on the lower vnode. Always use the lower vnode v_writecount for the checks. Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT to manipulate the v_writecount value, which manages a single bypass reference to the lower vnode. Caling the VOPs instead of directly accessing v_writecount provide the fix described in the previous paragraph. Tested by: pho MFC after: 3 weeks
242379	30-Oct-2012	trasz	Fix problem with geom_label(4) not recognizing UFS labels on filesystems extended using growfs(8). The problem here is that geom_label checks if the filesystem size recorded in UFS superblock is equal to the provider (i.e. device) size. This check cannot be removed due to backward compatibility. On the other hand, in most cases growfs(8) cannot set fs_size in the superblock to match the provider size, because, differently from newfs(8), it cannot recompute cylinder group sizes. To fix this problem, add another superblock field, fs_providersize, used only for this purpose. The geom_label(4) will attach if either fs_size (filesystem created with newfs(8)) or fs_providersize (filesystem expanded using growfs(8)) matches the device size. PR: kern/165962 Reviewed by: mckusick Sponsored by: FreeBSD Foundation
242259	28-Oct-2012	trasz	Fix two problems that caused instant panic when the device mounted with softupdates went away. Note that this does not fix the problem entirely; I'm committing it now to make it easier for someone to pick up the work. Reviewed by: mckusick
241896	22-Oct-2012	kib	Remove the support for using non-mpsafe filesystem modules. In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems. The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes. Conducted and reviewed by: attilio Tested by: pho
241011	27-Sep-2012	mdf	Fix up kernel sources to be ready for a 64-bit ino_t. Original code by: Gleb Kurtsou
239359	17-Aug-2012	mjg	Remove unused member of struct indir (in_exists) from UFS and EXT2 code. Reviewed by: mckusick Approved by: trasz (mentor) MFC after: 1 week
239065	05-Aug-2012	kib	After the PHYS_TO_VM_PAGE() function was de-inlined, the main reason to pull vm_param.h was removed. Other big dependency of vm_page.h on vm_param.h are PA_LOCK* definitions, which are only needed for in-kernel code, because modules use KBI-safe functions to lock the pages. Stop including vm_param.h into vm_page.h. Include vm_param.h explicitely for the kernel code which needs it. Suggested and reviewed by: alc MFC after: 2 weeks
238697	22-Jul-2012	kevlo	Use NULL instead of 0 for pointers
238029	02-Jul-2012	kib	Extend the KPI to lock and unlock f_offset member of struct file. It now fully encapsulates all accesses to f_offset, and extends f_offset locking to other consumers that need it, in particular, to lseek() and variants of getdirentries(). Ensure that on 32bit architectures f_offset, which is 64bit quantity, always read and written under the mtxpool protection. This fixes apparently easy to trigger race when parallel lseek()s or lseek() and read/write could destroy file offset. The already broken ABI emulations, including iBCS and SysV, are not converted (yet). Tested by: pho No objections from: jhb MFC after: 3 weeks
237366	21-Jun-2012	kib	Fix unbounded-length malloc, controlled from usermode. The added check is performed before exact size of the buffer is calculated, but the buffer cannot have size greater then the total space allocated for extended attributes. The existing check is executing with precise size, but it is too late, since buffer needs to be allocated in advance. Also, adapt to uio_resid being of ssize_t type. Use lblktosize instead of multiplying by fs block size by hand as well. Reported and tested by: pho MFC after: 1 week
236937	11-Jun-2012	mckusick	In softdep_setup_inomapdep() we may have to allocate both inodedep and bmsafemap dependency structures in inodedep_lookup() and bmsafemap_lookup() respectively. The setup of these structures must be done while holding the soft-dependency mutex. If the inodedep is allocated first, it may be freed in the I/O completion callback when the mutex is released to allocate the bmsafemap. If the bmsafemap is allocated first, it may be freed in the I/O completion callback when the mutex is released to allocate the inodedep. To resolve this problem, bmsafemap_lookup has had a parameter added that allows a pre-malloc'ed bmsafemap to be passed in so that it does not need to release the mutex to create a new bmsafemap. The softdep_setup_inomapdep() routine pre-malloc's a bmsafemap dependency before acquiring the mutex and starting to build the inodedep with a call to inodedep_lookup(). The subsequent call to bmsafemap_lookup() is passed this pre-allocated bmsafemap entry so that it need not release the mutex if it needs to create a new one. Reported by: Peter Holm Tested by: Peter Holm MFC after: 1 week
236322	30-May-2012	kib	Enable vn_io_fault() lock avoidance for UFS. Tested by: pho MFC after: 2 months
236044	26-May-2012	kib	Implement SEEK_HOLE/SEEK_DATA for UFS. MFC after: 2 weeks
235610	18-May-2012	mckusick	Add missing `continue' statement at end of case. Found by: Kevin Lo (kevlo@) MFC after: 1 week
234613	23-Apr-2012	trasz	Remove unused thread argument from ufs_extattr_uepm_lock()/ufs_extattr_uepm_unlock().
234612	23-Apr-2012	trasz	Fix build.
234608	23-Apr-2012	trasz	Remove unused thread argument from clear_inodeps() and clear_remove().
234607	23-Apr-2012	trasz	Remove unused thread argument to vrecycle(). Reviewed by: kib
234605	23-Apr-2012	trasz	Remove unused thread argument from vtruncbuf(). Reviewed by: kib
234537	21-Apr-2012	trasz	Fix use-after-free introduced in r234036. Reviewed by: mckusick Tested by: pho
234483	20-Apr-2012	mckusick	This update uses the MNT_VNODE_FOREACH_ACTIVE interface that loops over just the active vnodes associated with a mount point to replace MNT_VNODE_FOREACH_ALL in the vfs_msync, ffs_sync_lazy, and qsync routines. The vfs_msync routine is run every 30 seconds for every writably mounted filesystem. It ensures that any files mmap'ed from the filesystem with modified pages have those pages queued to be written back to the file from which they are mapped. The ffs_lazy_sync and qsync routines are run every 30 seconds for every writably mounted UFS/FFS filesystem. The ffs_lazy_sync routine ensures that any files that have been accessed in the previous 30 seconds have had their access times queued for updating in the filesystem. The qsync routine ensures that any files with modified quotas have those quotas queued to be written back to their associated quota file. In a system configured with 250,000 vnodes, less than 1000 are typically active at any point in time. Prior to this change all 250,000 vnodes would be locked and inspected twice every minute by the syncer. For UFS/FFS filesystems they would be locked and inspected six times every minute (twice by each of these three routines since each of these routines does its own pass over the vnodes associated with a mount point). With this change the syncer now locks and inspects only the tiny set of vnodes that are active. Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks
234421	18-Apr-2012	jh	The part about exec atime no longer applies in the comment. Pointed out by: bde
234386	17-Apr-2012	mckusick	Replace the MNT_VNODE_FOREACH interface with MNT_VNODE_FOREACH_ALL. The primary changes are that the user of the interface no longer needs to manage the mount-mutex locking and that the vnode that is returned has its mutex locked (thus avoiding the need to check to see if its is DOOMED or other possible end of life senarios). To minimize compatibility issues for third-party developers, the old MNT_VNODE_FOREACH interface will remain available so that this change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH will be removed in head. The reason for this update is to prepare for the addition of the MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the active vnodes associated with a mount point (typically less than 1% of the vnodes associated with the mount point). Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks
234158	11-Apr-2012	mckusick	Export vinactive() from kern/vfs_subr.c (e.g., make it no longer static and declare its prototype in sys/vnode.h) so that it can be called from process_deferred_inactive() (in ufs/ffs/ffs_snapshot.c) instead of the body of vinactive() being cut and pasted into process_deferred_inactive(). Reviewed by: kib MFC after: 2 weeks
234103	10-Apr-2012	jh	- Return EPERM from ufs_setattr() when an user without PRIV_VFS_SYSFLAGS privilege attempts to toggle SF_SETTABLE flags. - Use the '^' operator in the SF_SNAPSHOT anti-toggling check. Flags are now stored to ip->i_flags in one place after all checks. Submitted by: bde
234036	08-Apr-2012	trasz	Fix panic in ffs_reload(), which may happen when read-only filesystem gets resized and then reloaded. Reviewed by: kib, mckusick (earlier version) Sponsored by: The FreeBSD Foundation
234024	08-Apr-2012	mckusick	Drop an unnecessary setting of si_mountpt when updating a UFS mount point. Clearly it must have been set when the mount was done. Reviewed by: kib
233875	04-Apr-2012	jh	Add a check for unsupported file flags to ufs_setattr(). Discussed with: bde MFC after: 2 weeks
233817	02-Apr-2012	mckusick	A file cannot be deallocated until its last name has been removed and it is no longer referenced by a user process. The inode for a file whose name has been removed, but is still referenced at the time of a crash will still be allocated in the filesystem, but will have no references (e.g., they will have no names referencing them from any directory). With traditional soft updates these unreferenced inodes will be found and reclaimed when the background fsck is run. When using journaled soft updates, the kernel must keep track of these inodes so that it can find and reclaim them during the cleanup process. Their existence cannot be stored in the journal as the journal only handles short-term events, and they may persist for days. So, they are tracked by keeping them in a linked list whose head pointer is stored in the superblock. The journal tracks them only until their linked list pointers have been commited to disk. Part of the cleanup process involves traversing the list of unreferenced inodes and reclaiming them. This bug was triggered when confusion arose in the commit steps of keeping the unreferenced-inode linked list coherent on disk. Notably, a race between the link() system call adding a link-count to a file and the unlink() system call removing a link-count to the file. Here if the unlink() ran after link() had looked up the file but before link() had incremented the link-count of the file, the file's link-count would drop to zero before the link() incremented it back up to one. If the file was referenced by a user process, the first transition through zero made it appear that it should be added to the unreferenced-inode list when in fact it should not have been added. If the new name created by link() was deleted within a few seconds (with the file still referenced by a user process) it would legitimately be a candidate for addition to the unreferenced-inode list. The result was that there were two attempts to add the same inode to the unreferenced-inode list which scrambled the unreferenced-inode list's pointers leading to a panic. The fix is to detect and avoid the false attempt at adding it to the unreferenced-inode list by having the link() system call check to see if the link count is zero before it increments it. If it is, the link() fails with ENOENT (showing that it has failed the link()/unlink() race). While tracking down this bug, we have added additional assertions to detect the problem sooner and also simplified some of the code. Reported by: Kirk Russell Fix submitted by: Jeff Roberson Tested by: Peter Holm PR: kern/159971 MFC (to 9 only): 2 weeks
233787	02-Apr-2012	jh	- Use more natural ip->i_flags instead of vap->va_flags in the final flags check. - Add a comment for the immutable/append check done after handling of the flags. - Style improvements. No functional change intended. Submitted by: bde MFC after: 2 weeks
233629	28-Mar-2012	mckusick	A refinement of change 232351 to avoid a race with a forcible unmount. While we have a snapshot vnode unlocked to avoid a deadlock with another inode in the same inode block being updated, the filesystem containing it may be forcibly unmounted. When that happens the snapshot vnode is revoked. We need to check for that condition and fail appropriately. This change will be included along with 232351 when it is MFC'ed to 9. Spotted by: kib Reviewed by: kib
233627	28-Mar-2012	mckusick	Keep track of the mount point associated with a special device to enable the collection of counts of synchronous and asynchronous reads and writes for its associated filesystem. The counts are displayed using `mount -v'. Ensure that buffers used for paging indicate the vnode from which they are operating so that counts of paging I/O operations from the filesystem are collected. This checkin only adds the setting of the mount point for the UFS/FFS filesystem, but it would be trivial to add the setting and clearing of the mount point at filesystem mount/unmount time for other filesystems too. Reviewed by: kib
233610	28-Mar-2012	kib	Do trivial reformatting of the comment to record the missed commit message for r233609: Restore the writes of atimes, quotas and superblock from syncer vnode. Noted by: rdivacky
233609	28-Mar-2012	kib	Reviewed by: bde, mckusick Tested by: pho MFC after: 2 weeks
233608	28-Mar-2012	kib	Microoptimize: in qsync loop over mount vnodes, only unlock mount interlock after we committed to try to vget() the vnode. Submitted by: bde Reviewed by: mckusick Tested by: pho MFC after: 1 week
233607	28-Mar-2012	kib	Update comment. MFC after: 3 days
233438	25-Mar-2012	mckusick	Add a third flags argument to ffs_syncvnode to avoid a possible conflict with MNT_WAIT flags that passed in its second argument. This will be MFC'ed together with r232351. Discussed with: kib
232948	13-Mar-2012	kib	Supply boolean as the second argument to ffs_update(), and not a MNT_[NO]WAIT constants, which in fact always caused sync operation. Based on the submission by: bde Reviewed by: mckusick MFC after: 2 weeks
232837	11-Mar-2012	kib	Remove superfluous brackets. Submitted by: alc MFC after: 2 weeks
232836	11-Mar-2012	kib	Do schedule delayed writes for async mounts. While there, make some style adjustments, like missed () around return values. Submitted by: bde Reviewed by: mckusick Tested by: pho MFC after: 2 weeks
232835	11-Mar-2012	kib	Do not fall back to slow synchronous i/o when low on memory or buffers. The bawrite() schedules the write to happen immediately, and its use frees the current thread to do more cleanups. Submitted by: bde Reviewed by: mckusick Tested by: pho MFC after: 2 weeks
232834	11-Mar-2012	kib	In ffs_syncvnode(), pass boolean false as second argument of ffs_update(). Synchronous inode block update is not needed for MNT_LAZY callers (syncer), and since waitfor values are not zero, code did unneccessary synchronous update. Submitted by: bde Reviewed by: mckusick Tested by: pho MFC after: 2 weeks
232833	11-Mar-2012	kib	Remove not needed ARGSUSED lint command. Submitted by: bde MFC after: 3 days
232821	11-Mar-2012	kib	Remove fifo.h. The only used function declaration from the header is migrated to sys/vnode.h. Submitted by: gianni
232732	09-Mar-2012	pho	Revert r232692 as the correct place to fix this is at the syscall level.
232709	09-Mar-2012	kib	Decomission mnt_noasync. Introduce MNTK_NOASYNC mnt_kern_flag which allows a filesystem to request VFS to not allow MNTK_ASYNC. MFC after: 1 week
232701	08-Mar-2012	jhb	Add KTR_VFS traces to track modifications to a vnode's writecount.
232692	08-Mar-2012	pho	syscall() fuzzing can trigger this panic. Return EINVAL instead. MFC after: 1 week
232401	02-Mar-2012	jhb	Similar to the fixes in 226967 and 226987, purge any name cache entries associated with the previous vnode (if any) associated with the target of a rename(). Otherwise, a lookup of the target pathname concurrent with a rename() could re-add a name cache entry after the namei(RENAME) lookup in kern_renameat() had purged the target pathname. MFC after: 2 weeks
232351	01-Mar-2012	mckusick	This change avoids a kernel deadlock on "snaplk" when using snapshots on UFS filesystems running with journaled soft updates. This is the first of several bugs that need to be fixed before removing the restriction added in -r230250 to prevent the use of snapshots on filesystems running with journaled soft updates. The deadlock occurs when holding the snapshot lock (snaplk) and then trying to flush an inode via ffs_update(). We become blocked by another process trying to flush a different inode contained in the same inode block that we need. It holds the inode block for which we are waiting locked. When it tries to write the inode block, it gets blocked waiting for the our snaplk when it calls ffs_copyonwrite() to see if the inode block needs to be copied in our snapshot. The most obvious place that this deadlock arises is in the ffs_copyonwrite() routine when it updates critical metadata in a snapshot and tries to write it out before proceeding. The fix here is to write the data and indirect block pointer for the snapshot, but to skip the call to ffs_update() to write the snapshot inode. To ensure that we will never have to update a pointer in the inode itself, the ffs_snapshot() routine that creates the snapshot has to ensure that all the direct blocks are allocated as part of the creation of the snapshot. A less obvious place that this deadlock occurs is when we hold the snaplk because we are deleting a snapshot. In the course of doing the deletion, we need to allocate various soft update dependency structures and allocate some journal space. If we hit a resource limit while doing this we decrease the resources in use by flushing out an existing dirty file to get it to give up the soft dependency resources that it holds. The flush can cause an ffs_update() to be done on the inode for the file that we have selected to flush resulting in the same deadlock as described above when the inode that we have chosen to flush resides in the same inode block as the snapshot inode that we hold. The fix is to defer cleaning up any time that the inode on which we are operating is a snapshot. Help and review by: Jeff Roberson Tested by: Peter Holm MFC (to 9 only) after: 2 weeks
232003	22-Feb-2012	kib	Properly lock DQREF() with dqhlock. Missed locking caused counter corruption. Assert that the dq reference value is sane before decrementing it. Reported and tested by: pho MFC after: 1 week
231949	21-Feb-2012	kib	Fix found places where uio_resid is truncated to int. Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from the usermode. Discussed with: bde, das (previous versions) MFC after: 1 month
231572	13-Feb-2012	mckusick	Missing conditions in checking whether an inode has been written. Found and tested by: Peter Holm MFC after: 2 weeks (to 9 only)
231313	09-Feb-2012	mckusick	Historically when an application wrote an entire block of a file, the kernel allocated a buffer but did not zero it as it was about to be completely filled by a uiomove() from the user's buffer. However, if the uiomove() failed, the old contents of the buffer could be exposed especially if the file was being mmap'ed. The fix was to always zero the buffer when it was allocated. This change first attempts the uiomove() to the newly allocated (and dirty) buffer and only zeros it if the uiomove() fails. The effect is to eliminate the gratuitous zeroing of the buffer in the usual case where the uiomove() successfully fills it. Reviewed by: kib Tested by: scottl MFC after: 2 weeks (to 9 only)
231160	07-Feb-2012	mckusick	In the original days of BSD, a sync was issued on every filesystem every 30 seconds. This spike in I/O caused the system to pause every 30 seconds which was quite annoying. So, the way that sync worked was changed so that when a vnode was first dirtied, it was put on a 30-second cleaning queue (see the syncer_workitem_pending queues in kern/vfs_subr.c). If the file has not been written or deleted after 30 seconds, the syncer pushes it out. As the syncer runs once per second, dirty files are trickled out slowly over the 30-second period instead of all at once by a call to sync(2). The one drawback to this is that it does not cover the filesystem metadata. To handle the metadata, vfs_allocate_syncvnode() is called to create a "filesystem syncer vnode" at mount time which cycles around the cleaning queue being sync'ed every 30 seconds. In the original design, the only things it would sync for UFS were the filesystem metadata: inode blocks, cylinder group bitmaps, and the superblock (e.g., by VOP_FSYNC'ing devvp, the device vnode from which the filesystem is mounted). Somewhere in its path to integration with FreeBSD the flushing of the filesystem syncer vnode got changed to sync every vnode associated with the filesystem. The result of this change is to return to the old filesystem-wide flush every 30-seconds behavior and makes the whole 30-second delay per vnode useless. This change goes back to the originally intended trickle out sync behavior. Key to ensuring that all the intended semantics are preserved (e.g., that all inode updates get flushed within a bounded period of time) is that all inode modifications get pushed to their corresponding inode blocks so that the metadata flush by the filesystem syncer vnode gets them to the disk in a timely way. Thanks to Konstantin Belousov (kib@) for doing the audit and commit -r231122 which ensures that all of these updates are being made. Reviewed by: kib Tested by: scottl MFC after: 2 weeks
231122	07-Feb-2012	kib	Sprinkle missed calls to asynchronous UFS_UPDATE() in attempt to guarantee that all UFS inode metadata changes results in the dirtiness of the inodeblock. Due to missed inodeblock updates, syncer was required to fsync each mount point' vnode to guarantee periodic metadata flush. Reviewed by: mckusick Tested by: scottl MFC after: 2 weeks
231091	06-Feb-2012	kib	Add missing opt_quota.h include to activate #ifdef QUOTA blocks, apparently a step in unbreaking QUOTA support. Reported and tested by: Adam Strohl <adams-freebsd ateamsystems com> MFC after: 1 week
231077	06-Feb-2012	kib	JNEWBLK dependency may legitimately appear on the buf dependency list. If softdep_sync_buf() discovers such dependency, it should do nothing, which is safe as it is only waiting on the parent buffer to be written, so it can be removed. Committed on behalf of: jeff MFC after: 1 week
231075	06-Feb-2012	kib	Current implementations of sync(2) and syncer vnode fsync() VOP uses mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which is needed to guarantee a synchronous completion of the initiated i/o before syscall or VOP return. Global removal of MNTK_ASYNC option is harmful because not only i/o started from corresponding thread becomes synchronous, but all i/o is synchronous on the filesystem which is initiated during sync(2) or syncer activity. Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local thread flag to disable async i/o for current thread only. Use the opportunity to move DOINGASYNC() macro into sys/vnode.h and consistently use it through places which tested for MNTK_ASYNC. Some testing demonstrated 60-70% improvements in run time for the metadata-intensive operations on async-mounted UFS volumes, but still with great deviation due to other reasons. Reviewed by: mckusick Tested by: scottl MFC after: 2 weeks
230250	17-Jan-2012	mckusick	There are several bugs/hangs when trying to take a snapshot on a UFS/FFS filesystem running with journaled soft updates. Until these problems have been tracked down, return ENOTSUPP when an attempt is made to take a snapshot on a filesystem running with journaled soft updates. MFC after: 2 weeks
230249	17-Jan-2012	mckusick	Make sure all intermediate variables holding mount flags (mnt_flag) and that all internal kernel calls passing mount flags are declared as uint64_t so that flags in the top 32-bits are not lost. MFC after: 2 weeks
230221	16-Jan-2012	ivoras	Add a bit of verbosity to the comment.
230101	14-Jan-2012	mckusick	Convert FFS mount error messages from kernel printf's to using the vfs_mount_error error message facility provided by the nmount interface. Clean up formatting of mount warnings which still need to use kernel printf's since they do not return errors. Requested by: Craig Rodrigues <rodrigc@crodrigues.org> MFC after: 2 weeks
229828	08-Jan-2012	kib	Avoid LOR between vfs_busy() lock and covered vnode lock on quotaon(). The vfs_busy() is after covered vnode lock in the global lock order, but since quotaon() does recursive VFS call to open quota file, we usually end up locking covered vnode after mp is busied in sys_quotactl(). Change the interface of VFS_QUOTACTL(), requiring that mp was unbusied by fs code, and do not try to pick up vfs_busy() reference in ufs quotaon, esp. if vfs_busy cannot succeed due to unmount being performed. Reported and tested by: pho MFC after: 1 week
229200	01-Jan-2012	ed	Migrate ufs and ext2fs from skpc() to memcchr(). While there, remove a useless check from the code. memcchr() always returns characters unequal to 0xff in this case, so inosused[i] ^ 0xff can never be equal to zero. Also, the fact that memcchr() returns a pointer instead of the number of bytes until the end, makes conversion to an offset far more easy.
227382	09-Nov-2011	gleb	Use implementation independent inoNN_t scalars for on-disk UFS structures Approved by: mdf (mentor)
227309	07-Nov-2011	ed	Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs. The SYSCTL_NODE macro defines a list that stores all child-elements of that node. If there's no SYSCTL_DECL macro anywhere else, there's no reason why it shouldn't be static.
227267	06-Nov-2011	ed	Remove MALLOC_DECLAREs of nonexisting malloc-pools. After careful grepping, it seems none of these pools can be found in our source tree. They are not in use, nor are they defined.
226971	31-Oct-2011	pho	Fix the wrong commit log message for r226967: "Added missing cache purge of from argument" and fix the comment.
226967	31-Oct-2011	pho	The kern_renameat() looks up the fvp using the DELETE flag, which causes the removal of the name cache entry for fvp. Reported by: Anton Yuzhaninov <citrin citrin ru> In collaboration with: kib MFC after: 1 week
225807	27-Sep-2011	mckusick	This update eliminates a lock-order reversal warning discovered whle tracking down the system hang reported in kern/160662 and corrected in revision 225806. The LOR is not the cause of the system hang and indeed cannot cause an actual deadlock. However, it can be easily eliminated by defering the acquisition of a buflock until after all the vnode locks have been acquired. Reported by: Hans Ottevanger PR: kern/160662
225806	27-Sep-2011	mckusick	This update eliminates the system hang reported in kern/160662 when taking a snapshot on a filesystem running with journaled soft updates. Reported by: Hans Ottevanger Fix verified by: Hans Ottevanger PR: kern/160662
225700	20-Sep-2011	kib	Use nowait sync request for a vnode when doing softdep cleanup. We possibly own the unrelated vnode lock, doing waiting sync causes deadlocks. Reported and tested by: pho Approved by: re (bz)
225166	25-Aug-2011	mm	Generalize ffs_pages_remove() into vn_pages_remove(). Remove mapped pages for all dataset vnodes in zfs_rezget() using new vn_pages_remove() to fix mmapped files changed by zfs rollback or zfs receive -F. PR: kern/160035, kern/156933 Reviewed by: kib, pjd Approved by: re (kib) MFC after: 1 week
225104	23-Aug-2011	ae	Fix lock leak. Reported by: Alex Lyashkov Approved by: re (kib) MFC after: 1 week
224876	15-Aug-2011	rwatson	Fix two cases involving opt_capsicum.h and module builds: (1) opt_capsicum.h is no longer required in ffs_alloc.c, so remove the #include. (2) portalfs depends on opt_capsicum.h, so have the Makefile generate one if required. These affect only modules built without a kernel (i.e, not buildkernel, but yes buildworld if the dubious MODULES_WITH_WORLD is used). Approved by: re (bz) Sponsored by: Google Inc
224778	11-Aug-2011	rwatson	Second-to-last commit implementing Capsicum capabilities in the FreeBSD kernel for FreeBSD 9.0: Add a new capability mask argument to fget(9) and friends, allowing system call code to declare what capabilities are required when an integer file descriptor is converted into an in-kernel struct file *. With options CAPABILITIES compiled into the kernel, this enforces capability protection; without, this change is effectively a no-op. Some cases require special handling, such as mmap(2), which must preserve information about the maximum rights at the time of mapping in the memory map so that they can later be enforced in mprotect(2) -- this is done by narrowing the rights in the existing max_protection field used for similar purposes with file permissions. In namei(9), we assert that the code is not reached from within capability mode, as we're not yet ready to enforce namespace capabilities there. This will follow in a later commit. Update two capability names: CAP_EVENT and CAP_KEVENT become CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they represent. Approved by: re (bz) Submitted by: jonathan Sponsored by: Google Inc
224503	30-Jul-2011	mckusick	Update to -r224294 to ensure that only one of MNT_SUJ or MNT_SOFTDEP is set so that mount can revert back to using MNT_NOWAIT when doing getmntinfo. Approved by: re (kib)
224294	24-Jul-2011	mckusick	Move the MNTK_SUJ flag in mnt_kern_flag to MNT_SUJ in mnt_flag so that it is visible to userland programs. This change enables the `mount' command with no arguments to be able to show if a filesystem is mounted using journaled soft updates as opposed to just normal soft updates. Approved by: re (bz)
224272	22-Jul-2011	mckusick	Default debugging error messages to off for journaled soft updates sysctls. Delete limiting on output of these sysctls. Approved by: re (kib)
224061	15-Jul-2011	mckusick	Add an FFS specific mount option to allow a filesystem checker (typically fsck_ffs) to register that it wishes to use FFS specific sysctl's to update the filesystem. This ensures that two checkers cannot run on a given filesystem at the same time and that no other process accidentally or maliciously uses the filesystem updating sysctls inappropriately. This functionality is needed by the journaling soft-updates recovery code.
224027	14-Jul-2011	mckusick	Consistently check mount flag (MNTK_SUJ) rather than superblock flag (FS_SUJ) when determining whether to do journaling-based operations. The mount flag is set only when journaling is active while the superblock flag is set to indicate that journaling is to be used. For example, when the filesystem is mounted read-only, the journaling may be present (FS_SUJ) but not active (MNTK_SUJ). Inappropriate checking of the FS_SUJ flag was causing some journaling actions to be attempted at inappropriate times.
223902	10-Jul-2011	mckusick	When first creating snapshots, we may free some blocks within it. These blocks should not have TRIM applied to them. Submitted by: Kostik Belousov
223900	10-Jul-2011	mckusick	Allow disk partitions associated with UFS read-only mounted filesystems to be opened for writing. This functionality used to be special-cased for just the root filesystem, but with this change is now available for all UFS filesystems. This change is needed for journaled soft updates recovery. Discussed with: Jeff Roberson
223888	09-Jul-2011	kib	Use 'curthread_pflags' instead of 'thread_pflags' to signify that only curthread can be operated upon. Requested by: attilio MFC after: 1 week
223887	09-Jul-2011	kib	Use helper functions instead of manually managing TDP_INBDFLUSH. Sponsored by: The FreeBSD Foundation Reviewed by: alc (previous version) MFC after: 1 week
223772	04-Jul-2011	jeff	- Speed up pendingblock processing again. Having too much delay between ffs_blkfree() and the pending adjustment causes all kinds of space related problems.
223771	04-Jul-2011	jeff	- Handle D_JSEGDEP in the softdep_sync_buf() switch. These can now find themselves on snapshot vnodes. Reported by: pho
223770	04-Jul-2011	jeff	- It is impossible to run request_cleanup() while doing a copyonwrite. This will most likely cause new block allocations which can recurse into request cleanup. - While here optimize the ufs locking slightly. We need only acquire and drop once. - process_removes() and process_truncates() also is only needed once. - Attempt to flush each item on the worklist once but do not loop forever if some can not be completed. Discussed with: mckusick
223769	04-Jul-2011	jeff	- Fix an inode quota leak. We need to decrement the quota once and only once. Tested by: pho Reviewed by: mckusick
223687	29-Jun-2011	mckusick	Handle the FREEDEP case in softdep_sync_buf(). This fix failed to get added in -r223325. Submitted by: Peter Holm
223677	29-Jun-2011	alc	Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this option to vm_object_page_remove() asserts that the specified range of pages is not mapped, or more precisely that none of these pages have any managed mappings. Thus, vm_object_page_remove() need not call pmap_remove_all() on the pages. This change not only saves time by eliminating pointless calls to pmap_remove_all(), but it also eliminates an inconsistency in the use of pmap_remove_all() versus related functions, like pmap_remove_write(). It eliminates harmless but pointless calls to pmap_remove_all() that were being performed on PG_UNMANAGED pages. Update all of the existing assertions on pmap_remove_all() to reflect this change. Reviewed by: kib
223325	20-Jun-2011	jeff	- Fix directory count rollbacks by passing the mode to the journal dep earlier. - Add rollback/forward code for frag and cluster accounting. - Handle the FREEDEP case in softdep_sync_buf(). (submitted by pho)
223268	18-Jun-2011	mckusick	Fixed dereference of a NULL pointer. Reported by: Peter Holm
223169	16-Jun-2011	mckusick	Drop the include of <ufs/ffs/ffs_extern.h> from usr.sbin/makefs/ffs/ffs_bswap.c and usr.sbin/makefs/ffs/ffs_subr.c as they have no need of anything in that file. No other programs or libraries include <ufs/ffs/ffs_extern.h> (nor should they as it is totally in-kernel interfaces). For added protection I enclosed the entire contents of <ufs/ffs/ffs_extern.h> in ifdef _KERNEL. Feedback from: Bruce Evans and Tai-hwa Liang
223138	16-Jun-2011	avatar	Fixing compilation bustage by introducing another forward declaration.
223127	15-Jun-2011	mckusick	Ensure that filesystem metadata contained within persistent snapshots is always kept consistent. Suggested by: Jeff Roberson
223114	15-Jun-2011	mckusick	With the restructuring of the block reclaimation code, the notification messages for a filesystem being out of space need to be moved so that they do not print out until after a failed cleanup attempt. Suggested by: Jeff Roberson
223105	15-Jun-2011	mckusick	Missing cleanup case after completion of a snapshot vnode write claiming a released block. Submitted by: Jeff Roberson Tested by: Peter Holm
223052	13-Jun-2011	dim	Use alternative, less messy solution to avoid breakage after r223020: put the snapdata structure between #ifdef _KERNEL guards. Suggested by: kib
223020	12-Jun-2011	mckusick	Update to soft updates journaling to properly track freed blocks that get claimed by snapshots. Submitted by: Jeff Roberson Tested by: Peter Holm
223018	12-Jun-2011	mckusick	Disable the soft updates journaling after a filesystem is successfully downgraded to read-only. It will be restarted if the filesystem is upgraded back to read-write.
222958	10-Jun-2011	jeff	Implement fully asynchronous partial truncation with softupdates journaling to resolve errors which can cause corruption on recovery with the old synchronous mechanism. - Append partial truncation freework structures to indirdeps while truncation is proceeding. These prevent new block pointers from becoming valid until truncation completes and serialize truncations. - On completion of a partial truncate journal work waits for zeroed pointers to hit indirects. - softdep_journal_freeblocks() handles last frag allocation and last block zeroing. - vtruncbuf/ffs_page_remove moved into softdep_*_freeblocks() so it is only implemented in one place. - Block allocation failure handling moved up one level so it does not proceed with buf locks held. This permits us to do more extensive reclaims when filesystem space is exhausted. - softdep_sync_metadata() is broken into two parts, the first executes once at the start of ffs_syncvnode() and flushes truncations and inode dependencies. The second is called on each locked buf. This eliminates excessive looping and rollbacks. - Improve the mechanism in process_worklist_item() that handles acquiring vnode locks for handle_workitem_remove() so that it works more generally and does not loop excessively over the same worklist items on each call. - Don't corrupt directories by zeroing the tail in fsck. This is only done for regular files. - Push a fsync complete record for files that need it so the checker knows a truncation in the journal is no longer valid. Discussed with: mckusick, kib (ffs_pages_remove and ffs_truncate parts) Tested by: pho
222955	10-Jun-2011	jeff	- Add support for referencing quota structures without needing the inode pointer for softupdates. Submitted by: mckusick
222954	10-Jun-2011	jeff	- If the fsync in ufs_direnter fails SUJ can later panic because we have partially added a name. Allow ufs_direnter() to continue in the hopes that it is a transient error. If it is not, the directory is corrupted already from IO errors and writing this new block is not likely to make things worse.
222724	05-Jun-2011	mckusick	Grammer fix in comment. Eliminate one (of several) possible conflicting buffer locks when trying to reclaim blocks. Rest of fix to be incorporated as part of SUJ update by jeff. Pointed out by: Kostik Belousov
222422	28-May-2011	mckusick	Due to a lag in updating the fs_pendinginodes count, we cannot depend on it to decide whether we should try to reclaim inodes when we run short. Discovered by: Peter Holm
222334	26-May-2011	mckusick	The check for whether a block is going to be claimed by a snapshot needs to happen before we notify the underlying layer that it is being freed.
222196	22-May-2011	rmacklem	Fix the ufs/ffs file system so that it uses the lock flags argument added to VFS_FHTOVP() by r222167. Reviewed by: mckusick
222167	22-May-2011	rmacklem	Add a lock flags argument to the VFS_FHTOVP() file system method, so that callers can indicate the minimum vnode locking requirement. This will allow some file systems to choose to return a LK_SHARED locked vnode when LK_SHARED is specified for the flags argument. This patch only adds the flag. It does not change any file system to use it and all callers specify LK_EXCLUSIVE, so file system semantics are not changed. Reviewed by: kib
221829	13-May-2011	mdf	Use a name instead of a magic number for kern_yield(9) when the priority should not change. Fetch the td_user_pri under the thread lock. This is probably not necessary but a magic number also seems preferable to knowing the implementation details here. Requested by: Jason Behmer < jason DOT behmer AT isilon DOT com >
221281	30-Apr-2011	kib	Fix typos. Noted by: Fabian Keil <freebsd-listen fabiankeil de> Pointy hat to: kib MFC after: 1 week
221261	30-Apr-2011	kib	Clarify the comment. MFC after: 1 week
220985	24-Apr-2011	kib	VFS sometimes is unable to inactivate a vnode when vnode use count goes to zero. E.g., the vnode might be only shared-locked at the time of vput() call. Such vnodes are kept in the hash, so they can be found later. If ffs_valloc() allocated an inode that has its vnode cached in hash, and still owing the inactivation, then vget() call from ffs_valloc() clears VI_OWEINACT, and then the vnode is reused for the newly allocated inode. The problem is, the vnode is not reclaimed before it is put to the new use. ffs_valloc() recycles vnode vm object, but this is not enough. In particular, at least v_vflag should be cleared, and several bits of UFS state need to be removed. It is very inconvenient to call vgone() at this point. Instead, move some parts of ufs_reclaim() into helper function ufs_prepare_reclaim(), and call the helper from VOP_RECLAIM and ffs_valloc(). Reviewed by: mckusick Tested by: pho MFC after: 3 weeks
220532	11-Apr-2011	jeff	- Refactor softdep_setup_freeblocks() into a set of functions to prepare for a new journal specific partial truncate routine. - Use dep_current[] in place of specific dependency counts. This is automatically maintained when workitems are allocated and has less risk of becoming incorrect.
220511	10-Apr-2011	jeff	Fix a long standing SUJ performance problem: - Keep a hash of indirect blocks that have recently been freed and are still referenced in the journal. - Lookup blocks in this hash before forcing a new block write to wait on the journal entry to hit the disk. This is only necessary to avoid confusion between old identities as indirects and new identities as file blocks. - Don't free jseg structures until the journal has written a record that invalidates it. This keeps the indirect block information around for as long as is required to be safe. - Force an empty journal block write when required to flush out stale journal data that is simply waiting for the oldest valid sequence number to advance beyond it.
220406	07-Apr-2011	jeff	- Don't invalidate jnewblks immediately upon discovering that the block will be removed. Permit the journal to proceed so that we don't leave a rollback in a cg for a very long time as this can cause terrible perf problems in low memory situations. Tested by: pho
220374	05-Apr-2011	mckusick	Be far more persistent in reclaiming blocks and inodes before giving up and declaring a filesystem out of space. Especially necessary when running on a small filesystem. With this improvement, it should be possible to use soft updates on a small root filesystem. Kudos to: Peter Holm Testing by: Peter Holm MFC: 2 weeks
220282	02-Apr-2011	jeff	Fix problems that manifested from filesystem full conditions: - In softdep_revert_mkdir() find the dotaddref before we attempt to cancel the jaddref so we can make assumptions about where the dotaddref is on the list. cancel_jaddref() does not always remove items from the list anymore. - Always set GOINGAWAY on an inode in softdep_freefile() if DEPCOMPLETE was never set. This ensures that dependencies will continue to be processed on the inowait/bufwait list and is more an artifact of the structure of the code than a pure ordering problem. - Always set DEPCOMPLETE on canceled jaddrefs so that they can be freed appropriately. This normally occurs when the refs are added to the journal but if they are canceled before this point the state would never be set and the dependency could never be freed. Reported by: pho Tested by: pho
220099	28-Mar-2011	kib	Fix the softdep_request_cleanup() function definition for !SOFTUPDATES case. Submitted by: Aleksandr Rybalko <ray dlink ua>
219895	23-Mar-2011	mckusick	Add retry code analogous to the block allocation retry code to avoid running out of inodes. Reported by: Peter Holm
219804	20-Mar-2011	kib	Retire opt_ffs_broken_fixme.h. Instead of directly calling ffs_snapgone(), use UFS_SNAPGONE() with usual layering. Requested by: bde MFC after: 1 week
219712	17-Mar-2011	kib	Remove the #if defined(FFS) \|\| defined(IFS) braces around the calls to ffs_snapgone(). ufs.ko module is not build with FFS define, causing snapshot inode number slots in superblock never be freed, as well as a reference on the snapshot vnode. IFS was removed several years ago, and UFS/FFS separation was not maintained for real. Reported, analyzed and tested by: Yamagi Burmeister <lists yamagi org> MFC after: 3 days
219388	07-Mar-2011	kib	Simplify uses of the web of pointers. Reviewed by: mckusick MFC after: 1 week
219384	07-Mar-2011	jhb	The UFS dirhash code was attempting to update shared state in the dirhash from multiple threads while holding a shared lock during a lookup operation. This could result in incorrect ENOENT failures which could then be permanently stored in the name cache. Specifically, the dirhash code optimizes the case that a single thread is walking a directory sequentially opening (or stat'ing) each file. It uses state in the dirhash structure to determine if a given lookup is using the optimization. If the optimization fails, it disables it and restarts the lookup. The problem arises when two threads both attempt the optimization and fail. The first thread will restart the loop, but the second thread will incorrectly think that it did not try the optimization and will only examine a subset of the directory entires in its hash chain. As a result, it may fail to find its directory entry and incorrectly fail with ENOENT. To make this safe for use with shared locks, simplify the state stored in the dirhash and move some of the state (the part that determines if the current thread is trying the optimization) into a local variable. One result is that we will now try the optimization more often. We still update the value under the shared lock, but it is a single atomic store similar to i_diroff that is stored in UFS directory i-nodes for the non-dirhash lookup. Reviewed by: kib MFC after: 1 week
219276	04-Mar-2011	jhb	Use ffs() to locate free bits in the inode bitmap rather than a loop with bit shifts. Reviewed by: mckusick MFC after: 1 month
218838	19-Feb-2011	kib	v_mountedhere is a member of the union. Check that the vnodes have proper type before using the member. Reported and tested by: Michael Butler <imb protected-networks net>
218602	12-Feb-2011	kib	Use the native sector size of the device backing the UFS volume for SU+J journal blocks, instead of hard coding 512 byte sector size. Journal need to atomically write the block, that can only be guaranteed at the device sector size, not larger. Attempt to write less then sector size results in driver errors. Note that this is the first structure in UFS that depends on the sector size. Other elements are written in the units of fragments. In collaboration with: pho Reviewed by: jeff Tested by: bz, pho
218513	10-Feb-2011	netchild	Wrap long line. Noticed by: bz
218485	09-Feb-2011	netchild	Add some FEATURE macros for some UFS features. SU+J is not included as a FEATURE macro: - it was not in the tree during the GSoC - I do not see an option to en-/disable it in NOTES Two minor changes where made during the review compared to what was developed during GSoC 2010. No FreeBSD version bump, the userland application to query the features will be committed last and can serve as an indication of the availablility if needed. Sponsored by: Google Summer of Code 2010 Submitted by: kibab Reviewed by: kib X-MFC after: to be determined in last commit with code from this project
218424	08-Feb-2011	mdf	Based on discussions on the svn-src mailing list, rework r218195: - entirely eliminate some calls to uio_yeild() as being unnecessary, such as in a sysctl handler. - move should_yield() and maybe_yield() to kern_synch.c and move the prototypes from sys/uio.h to sys/proc.h - add a slightly more generic kern_yield() that can replace the functionality of uio_yield(). - replace source uses of uio_yield() with the functional equivalent, or in some cases do not change the thread priority when switching. - fix a logic inversion bug in vlrureclaim(), pointed out by bde@. - instead of using the per-cpu last switched ticks, use a per thread variable for should_yield(). With PREEMPTION, the only reasonable use of this is to determine if a lock has been held a long time and relinquish it. Without PREEMPTION, this is essentially the same as the per-cpu variable.
218195	02-Feb-2011	mdf	Put the general logic for being a CPU hog into a new function should_yield(). Use this in various places. Encapsulate the common case of check-and-yield into a new function maybe_yield(). Change several checks for a magic number of iterations to use should_yield() instead. MFC after: 1 week
217357	13-Jan-2011	pluknet	Embed a quota error message (C string) into uprintf() fmt. While here, fix whitespaces. Approved by: kib (mentor)
217326	12-Jan-2011	mdf	sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly. Commit the kernel changes.
216951	04-Jan-2011	kib	Instead of incrementing freework reference counter in indir_trunc(), do it at the allocation time for journaled fs and indirect blocks, when the allocated object is not accessible outside. Requested and reviewed by: jeff Tested by: pho
216818	30-Dec-2010	kib	Handle missing jremrefs when a directory is renamed overtop of another, deleting it. If the directory is removed, UFS always need to remove the .. ref, even if the ultimate ref on the parent would not change. The new directory must have a new journal entry for that ref. Otherwise journal processing would not properly account for the parent's reference since it will belong to a removed directory entry. Change ufs_rename()'s dotdot rename section to always setup_dotdot_link(). In the tip != NULL case SUJ needs the newref dependency allocated via setup_dotdot_link(). Stop setting isrmdir to 2 for newdirrem() in softdep_setup_remove(). Remove the isdirrem > 1 checks from newdirrem(). Reported by: many Submitted by: jeff Tested by: pho
216817	30-Dec-2010	kib	In indir_trunc(), when processing jnewblk entries that are not written to the disk, recurse to handle indirect blocks of next level that are hidden by the corresponding entry. In collaboration with: pho Reviewed by: jeff, mckusick Tested by: mckusick, pho
216796	29-Dec-2010	kib	Add kernel side support for BIO_DELETE/TRIM on UFS. The FS_TRIM fs flag indicates that administrator requested issuing of TRIM commands for the volume. UFS will only send the command to disk if the disk reports GEOM::candelete attribute. Since disk queue is reordered, data block is marked as free in the bitmap only after TRIM command completed. Due to need to sleep waiting for i/o to finish, TRIM bio_done routine schedules taskqueue to set the bitmap bit. Based on the patch by: mckusick Reviewed by: mckusick, pjd Tested by: pho MFC after: 1 month
216795	29-Dec-2010	kib	Move the definition of mkdirlisthd from header to C file. Reviewed by: mckusick Tested by: pho
216792	29-Dec-2010	kib	Use a proper type for the variable holding the summary size of the inode data. Otherwise, on 32bit systems, unlinked inode which size is the multiple of 4GB was not truncated, causing corruption. Reported by: brucec Reviewed by: mckusick Tested by: pho
216676	23-Dec-2010	mckusick	This patch fixes a soft update panic while running perl 5.12 tests which produced: panic: indir_trunc: Index out of range -148 parent -2061 lbn -305164 Reported by: Dimitry Andric Fixed by: Jeff Roberson
216099	01-Dec-2010	kib	Journal start looks up .sujournal file by doing lookup on the root dvp. As result, failed softdep_mount() might leave up to two vnodes on the mp mountlist, preventing mnt_ref from going to zero. Call ffs_flushfiles() after failed softdep_mount() to clean mountlist. Initial report by: Garrett Cooper Reproduced and tested by: pho
215950	27-Nov-2010	pho	First step in fixing the handle_workitem_freeblocks panic. In collaboration with: kib
215576	20-Nov-2010	mckusick	Delete /sys/ufs/ffs/README.snapshot as it is no longer relevant. Drop reference to it in mount(8). MFC: 3 days
215548	19-Nov-2010	kib	Remove prtactive variable and related printf()s in the vop_inactive and vop_reclaim() methods. They seems to be unused, and the reported situation is normal for the forced unmount. MFC after: 1 week X-MFC-note: keep prtactive symbol in vfs_subr.c
215117	11-Nov-2010	kib	The softdep_setup_freeblocks() adds worklist items before deallocate_dependencies() is done. This opens a race between softdep thread and the thread that does the truncation: A write of the indirect block causes the freeblks to become ALLCOMPLETE while softdep_setup_freeblocks() dropped softdep lock. And then, softdep_disk_write_complete() would reassign the workitem to the mount point worklist, causing premature processing of the workitem, or journal write exhaust the fb_jfreeblkhd and handle_written_jfreeblk does the same reassign. indir_trunc() then would find the indirect block that is locked (with lock owned by kernel) but without any dependencies, causing it to hang in getblk() waiting for buffer lock. Do not mark freeblks as DEPCOMPLETE until deallocate_dependencies() finished. Analyzed, suggested and reviewed by: jeff Tested by: pho
215115	11-Nov-2010	kib	Change #ifdef INVARIANTS panic into KASSERT, and print some useful information to diagnose the issue, in handle_complete_freeblocks(). Reviewed by: jeff Tested by: pho
215114	11-Nov-2010	kib	In journal_mount(), only set MNTK_SUJ flag after the jblocks are mapped. I believe there is a window otherwise where jblocks can be accessed without proper initialization. Reviewed by: jeff Tested by: pho
215113	11-Nov-2010	kib	Add function lbn_offset to calculate offset of the indirect block of given level. Reviewed by: jeff Tested by: pho
215112	11-Nov-2010	kib	Fix typo. Function is called ffs_blkfree.
215052	09-Nov-2010	jhb	Remove unused includes of <sys/mutex.h> and <machine/mutex.h>.
214359	25-Oct-2010	ivoras	Bring vfs.ufs.dirhash_maxmem into the age of the fruitbat and make it autotuned. It is only an upper bound (the memory is not always allocated) and the system contains a vm_lowmem handler so nothing will crash and burn if it's tuned too high. Reviewed by: mckusick
213664	10-Oct-2010	kib	The r184588 changed the layout of struct export_args, causing an ABI breakage for old mount(2) syscall, since most struct <filesystem>_args embed export_args. The mount(2) is supposed to provide ABI compatibility for pre-nmount mount(8) binaries, so restore ABI to pre-r184588. Requested and reviewed by: bde MFC after: 2 weeks
213363	02-Oct-2010	alc	M_USE_RESERVE has been deprecated for a decade. Eliminate any uses that have no run-time effect.
213275	29-Sep-2010	mckusick	Since local variable 'i' is used only in a KASSERT, declare and initialize it only if INVARIANTS is defined to avoid a declared but unused warning. Suggested by: Brian Somers <brian@FreeBSD.org>
213259	29-Sep-2010	kib	Fix typo in comment.
212788	17-Sep-2010	obrien	Correct some non-code typos.
212617	14-Sep-2010	mckusick	Update comments in soft updates code to more fully describe the addition of journalling. Only functional change is to tighten a KASSERT. Reviewed by: jeff Roberson
211531	20-Aug-2010	jhb	Add dedicated routines to toggle lockmgr flags such as LK_NOSHARE and LK_CANRECURSE after a lock is created. Use them to implement macros that otherwise manipulated the flags directly. Assert that the associated lockmgr lock is exclusively locked by the current thread when manipulating these flags to ensure the flag updates are safe. This last change required some minor shuffling in a few filesystems to exclusively lock a brand new vnode slightly earlier. Reviewed by: kib MFC after: 3 days
211212	12-Aug-2010	kib	Softdep_process_worklist() should unsuspend not only before processing the worklist (in softdep_process_journal), but also after flushing the workitems. Might be, we should even do this before bwillwrite() too, but this seems to be not needed for now. Fs might be suspended during processing the queue, and then there is nobody around to unsuspend. In collaboration with: pho Tested by: bz Reviewed by: jeff
210172	16-Jul-2010	jhb	Revert the previous commit. The race is not applicable to the lockmgr implementation in 8.0 and later as its flags field does not hold dynamic state such as waiters flags, but is only modified in lockinit() aside from VN_LOCK_*(). Discussed with: attilio
210171	16-Jul-2010	jhb	When the MNTK_EXTENDED_SHARED mount option was added, some filesystems were changed to defer the setting of VN_LOCK_ASHARE() (which clears LK_NOSHARE in the vnode lock's flags) until after they had determined if the vnode was a FIFO. This occurs after the vnode has been inserted a VFS hash or some similar table, so it is possible for another thread to find this vnode via vget() on an i-node number and block on the vnode lock. If the lockmgr interlock (vnode interlock for vnode locks) is not held when clearing the LK_NOSHARE flag, then the lk_flags field can be clobbered. As a result the thread blocked on the vnode lock may never get woken up. Fix this by holding the vnode interlock while modifying the lock flags in this case. MFC after: 3 days
209717	06-Jul-2010	jeff	- Handle the truncation of an inode with an effective link count of 0 in the context of the process that reduced the effective count. Previously all truncation as a result of unlink happened in the softdep flush thread. This had the effect of being impossible to rate limit properly with the journal code. Now the process issuing unlinks is suspended when the journal files. This has a side-effect of improving rm performance by allowing more concurrent work. - Handle two cases in inactive, one for effnlink == 0 and another when nlink finally reaches 0. - Eliminate the SPACECOUNTED related code since the truncation is no longer delayed. Discussed with: mckusick
209367	20-Jun-2010	kib	Ensure that VOP_ACCESSX is called with exclusively locked vnode for the kernel compiled with QUOTA option. ufs_accessx() upgrades the vdp vnode lock from shared to exclusive to assign the dquot structure to the vnode, and ufs_delete_denied() is called when tvp is locked. Since upgrade drops shared lock when non-blocked upgrade failed, LOR is there. Reported and tested by: Dmitry Pryanishnikov <lynx.ripe gmail com> Tested by: pho PR: kern/147890 MFC after: 1 week
209057	11-Jun-2010	avg	ffs_softdep: change K&R in function defintions to ANSI prototypes Apparently it's bad when we first have an ANSI prototype in function declaration, but then use K&R in its defintion. Complaint from: clang MFC after: 2 weeks
208774	03-Jun-2010	kib	Extend the scope of the lock on the quota file vnode in quotaon() to cover the initial read by dqopen(). Assert that vnode is locked in dqopen(). Remove VFS_LOCK_GIANT() from dqopen(), since quotaon() keeps Giant locked if needed around the call.
208293	19-May-2010	avg	ffs_mount: accept and drop userland-only options that can be passed from loader(8) In r193192 loader(8) has grown an ability to pass root mount options from fstab via vfs.root.mountfrom.options. Unfortunately, some options that can be present in fstab are for userland only and lead to root mounting failure when seen by kernel. Rather than teaching loader about FFS-specific options that should be filtered out, ffs_mount recognizes those options as valid, but ignores and deletes[1] them. [1] is suggested by jh. PR: kern/141050 Reported by: many Reviewed by: jh, bde MFC after: 4 days
208287	19-May-2010	jeff	- Don't immediately re-run softdepflush if we didn't make any progress on the last iteration. This can lead to a deadlock when we have worklist items that cannot be immediately satisfied. Reported by: uqs, Dimitry Andric <dimitry@andric.com> - Remove some unnecessary debugging code and place some other under SUJ_DEBUG. - Examine the journal state in softdep_slowdown(). - Re-format some comments so I may more easily add flag descriptions.
207742	07-May-2010	jeff	- Call softdep_prealloc() before any of the balloc routines in the snapshot code. - Don't fsync() vnodes in prealloc if copy on write is in progress. It is not safe to recurse back into the write path here. Reported by: Vladimir Grebenschikov <vova@fbsd.ru>
207741	07-May-2010	jeff	- Use the correct flag mask when determining whether an inode has successfully made it to the free list yet or not. This fixes a deadlock that can occur with unlinked but referenced files. Journal space and inodedeps were not correctly reclaimed because the inode block was not left dirty. Tested/Reported by: lwindschuh@googlemail.com
207736	07-May-2010	mckusick	Merger of the quota64 project into head. This joint work of Dag-Erling Smørgrav and myself updates the FFS quota system to support both traditional 32-bit and new 64-bit quotas (for those of you who want to put 2+Tb quotas on your users). By default quotas are not compiled into the kernel. To include them in your kernel configuration you need to specify: options QUOTA # Enable FFS quotas If you are already running with the current 32-bit quotas, they should continue to work just as they have in the past. If you wish to convert to using 64-bit quotas, use `quotacheck -c 64'; if you wish to revert from 64-bit quotas back to 32-bit quotas, use `quotacheck -c 32'. There is a new library of functions to simplify the use of the quota system, do `man quotafile' for details. If your application is currently using the quotactl(2), it is highly recommended that you convert your application to use the quotafile interface. Note that existing binaries will continue to work. Special thanks to John Kozubik of rsync.net for getting me interested in pursuing 64-bit quota support and for funding part of my development time on this project.
207728	06-May-2010	alc	Eliminate page queues locking around most calls to vm_page_free().
207669	05-May-2010	alc	Acquire the page lock around all remaining calls to vm_page_free() on managed pages that didn't already have that lock held. (Freeing an unmanaged page, such as the various pmaps use, doesn't require the page lock.) This allows a change in vm_page_remove()'s locking requirements. It now expects the page lock to be held instead of the page queues lock. Consequently, the page queues lock is no longer required at all by callers to vm_page_rename(). Discussed with: kib
207662	05-May-2010	trasz	Move checking against RLIMIT_FSIZE into one place, vn_rlimit_fsize(). Reviewed by: kib
207366	29-Apr-2010	avg	ffs_vfsops: restore alphabetic order of options in ffs_opts The order was not correct only for nfsv4acls. ("no" prefix is ignored) MFC after: 1 week
207310	28-Apr-2010	jeff	- When canceling jaddrefs they may not yet be in the journal if this is via a revert call. In this case don't attempt to remove something that has not yet been added. Otherwise this jaddref must hang around to prevent the bitmap write as normal.
207309	28-Apr-2010	jeff	- Fix builds without SOFTUPDATES defined in the kernel config.
207142	24-Apr-2010	pjd	Fix build for UFS without SOFTUPDATES.
207141	24-Apr-2010	jeff	- Merge soft-updates journaling from projects/suj/head into head. This brings in support for an optional intent log which eliminates the need for background fsck on unclean shutdown. Sponsored by: iXsystems, Yahoo!, and Juniper. With help from: McKusick and Peter Holm
206894	20-Apr-2010	kib	The cache_enter(9) function shall not be called for doomed dvp. Assert this. In the reported panic, vdestroy() fired the assertion "vp has namecache for ..", because pseudofs may end up doing cache_enter() with reclaimed dvp, after dotdot lookup temporary unlocked dvp. Similar problem exists in ufs_lookup() for "." lookup, when vnode lock needs to be upgraded. Verify that dvp is not reclaimed before calling cache_enter(). Reported and tested by: pho Reviewed by: kan MFC after: 2 weeks
206128	03-Apr-2010	avg	ffs_mount: remove redundant assignment of geom consumer to devvp.v_bufobj The assignment is already done in g_vfs_open. Redundant assignment is harmless, but can become a problem if g_vfs_open logic is changed. MFC after: 1 week
203818	13-Feb-2010	kib	When ffs_realloccg() failed to allocate bigger fragment and, because pending blocks are scheduled for removal, goes to retry the (re)allocation, clear the bp pointer. It might happen that meantime free space is really exhausted and we are entering nospace: label without bread()ing buffer, causing stale bp value to be brelse()d again. Tested by: pho (Producing a scenario to reliably reproduce the race appeared to be much harder then fixing the bug) MFC after: 1 week
203784	11-Feb-2010	mckusick	One last pass to get all the unsigned comparisons correct.
203763	10-Feb-2010	mckusick	This fix corrects a problem in the file system that treats large inode numbers as negative rather than unsigned. For a default (16K block) file system, this bug began to show up at a file system size above about 16Tb. To fully handle this problem, newfs must be updated to ensure that it will never create a filesystem with more than 2^32 inodes. That patch will be forthcoming soon. Reported by: Scott Burns, John Kilburg, Bruce Evans Followup by: Jeff Roberson PR: 133980 MFC after: 2 weeks
203761	10-Feb-2010	trasz	Remove unused variable.
202971	25-Jan-2010	trasz	Return proper error code. Found with: clang
202934	24-Jan-2010	trasz	Move out code that does POSIX.1e ACL inheritance into separate routines. Reviewed by: rwatson
202125	11-Jan-2010	mckusick	Cast 64-bit quantity to intptr_t rather than int so as to work properly with 64-bit architectures (such as amd64). Reported by: bz
202113	11-Jan-2010	mckusick	Background: When renaming a directory it passes through several intermediate states. First its new name will be created causing it to have two names (from possibly different parents). Next, if it has different parents, its value of ".." will be changed from pointing to the old parent to pointing to the new parent. Concurrently, its old name will be removed bringing it back into a consistent state. When fsck encounters an extra name for a directory, it offers to remove the "extraneous hard link"; when it finds that the names have been changed but the update to ".." has not happened, it offers to rewrite ".." to point at the correct parent. Both of these changes were considered unexpected so would cause fsck in preen mode or fsck in background mode to fail with the need to run fsck manually to fix these problems. Fsck running in preen mode or background mode now corrects these expected inconsistencies that arise during directory rename. The functionality added with this update is used by fsck running in background mode to make these fixes. Solution: This update adds three new fsck sysctl commands to support background fsck in correcting expected inconsistencies that arise from incomplete directory rename operations. They are: setcwd(dirinode) - set the current directory to dirinode in the filesystem associated with the snapshot. setdotdot(oldvalue, newvalue) - Verify that the inode number for ".." in the current directory is oldvalue then change it to newvalue. unlink(nameptr, oldvalue) - Verify that the inode number associated with nameptr in the current directory is oldvalue then unlink it. As with all other fsck sysctls, these new ones may only be used by processes with appropriate priviledge. Reported by: jeff Security issues: rwatson
201758	07-Jan-2010	mbr	Remove extraneous semicolons, no functional changes. Submitted by: Marc Balmer <marc@msys.ch> MFC after: 1 week
201717	07-Jan-2010	mckusick	KASSERT that condition raised by Coverity cannot happen. Found by: Coverity Prevent (tm) KASSERT by: sam
200796	21-Dec-2009	trasz	Implement NFSv4 ACL support for UFS. Reviewed by: rwatson
200770	21-Dec-2009	kib	VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object flag. Besides providing the redundand information, need to update both vnode and object flags causes more acquisition of vnode interlock. OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects. Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for vnode-backed vm objects. Suggested and reviewed by: alc Tested by: pho MFC after: 3 weeks
197408	22-Sep-2009	rdivacky	Don't build ufs_gjournal.c at all if UFS_GJOURNAL option is not given instead of building an almost empty C file. Approved by: pjd Approved by: ed (mentor, implicit)
197269	17-Sep-2009	brooks	Allocate space for the group array in a static credential used in the quota code. One case was correctly handled in r194498, but this one was missed. PR: kern/138657 Tested by: PR submitter MFC after: 3 days
196987	08-Sep-2009	trasz	Remove useless variable assignment.
196920	07-Sep-2009	kib	insmntque_stddtr() clears vp->v_data and resets vp->v_op to dead_vnodeops before calling vgone(). Revert r189706 and corresponding part of the r186560. Noted and reviewed by: tegge Approved by: des (pseudofs part) MFC after: 3 days
196888	06-Sep-2009	kib	The clear_remove() and clear_inodedeps() call vn_start_write(NULL, &mp, V_NOWAIT) on the non-busied mount point. Unmount might free ufs-specific mp data, causing ffs_vgetf() to access freed memory. Busy mountpoint before dropping softdep lk. Noted and reviewed by: tegge Tested by: pho MFC after: 1 week
196206	14-Aug-2009	kib	When a UFS node is truncated to the zero length, e.g. by explicit truncate(2) call, or by being removed or truncated on open, either new softupdate freeblks structure is allocated to track the freed blocks of the node, or truncation is done syncronously when too many SU dependencies are accumulated. The decision does not take into account the allocated freeblks dependencies, allowing workloads that do huge amount of truncations to exhaust the kernel memory. Take the number of allocated freeblks into consideration for softdep_slowdown(). Reported by: pluknet gmail com Diagnosed and tested by: pho Approved by: re (rwatson) MFC after: 1 month
195296	02-Jul-2009	trasz	Fix fpathconf(3) on fifos, in effect making ls(1) properly display '+' on them. Taken from kern/125613, with cosmetic changes. PR: kern/125613 Submitted by: Jaakko Heinonen <jh at saunalahti dot fi> Approved by: re (kib)
195294	02-Jul-2009	kib	In vn_vget_ino() and their inline equivalents, mnt_ref() the mount point around the sequence that drop vnode lock and then busies the mount point. Not having vlocked node or direct reference to the mp allows for the forced unmount to proceed, making mp unmounted or reused. Tested by: pho Reviewed by: jeff Approved by: re (kensmith) MFC after: 2 weeks
195265	01-Jul-2009	trasz	Don't panic on attempt to set ACL on a block device file. This is just a part of kern/125613. PR: kern/125613 Submitted by: Jaakko Heinonen <jh at saunalahti dot fi> Reviewed by: rwatson Approved by: re (kib)
195187	30-Jun-2009	kib	For SU mounts, softdep_fsync() might drop vnode lock, allowing other threads to put dirty buffers on the vnode bufobj list. For regular files and synchronous fsync requests, check for the condition and restart the fsync vop if a new dirty buffer arrived. Tested by: pho Approved by: re (kensmith) MFC after: 1 month
195186	30-Jun-2009	kib	Softdep_fsync() may need to lock parent directory of the synced vnode. Use inlined (due to FFSV_FORCEINSMQ) version of vn_vget_ino() to prevent mountpoint from being unmounted and freed while no vnodes are locked. Tested by: pho Approved by: re (kensmith) MFC after: 1 month
195003	25-Jun-2009	snb	Fix a bug reported by pho@ where one can induce a panic by decreasing vfs.ufs.dirhash_maxmem below the current amount of memory used by dirhash. When ufsdirhash_build() is called with the memory in use greater than dirhash_maxmem, it attempts to free up memory by calling ufsdirhash_recycle(). If successful in freeing enough memory, ufsdirhash_recycle() leaves the dirhash list locked. But at this point in ufsdirhash_build(), the list is not explicitly unlocked after the call(s) to ufsdirhash_recycle(). When we next attempt to lock the dirhash list, we will get a "panic: _mtx_lock_sleep: recursed on non-recursive mutex dirhash list". Tested by: pho Approved by: dwmalone (mentor) MFC after: 3 weeks
194498	19-Jun-2009	brooks	Rework the credential code to support larger values of NGROUPS and NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024 and 1023 respectively. (Previously they were equal, but under a close reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it is the number of supplemental groups, not total number of groups.) The bulk of the change consists of converting the struct ucred member cr_groups from a static array to a pointer. Do the equivalent in kinfo_proc. Introduce new interfaces crcopysafe() and crsetgroups() for duplicating a process credential before modifying it and for setting group lists respectively. Both interfaces take care for the details of allocating groups array. crsetgroups() takes care of truncating the group list to the current maximum (NGROUPS) if necessary. In the future, crsetgroups() may be responsible for insuring invariants such as sorting the supplemental groups to allow groupmember() to be implemented as a binary search. Because we can not change struct xucred without breaking application ABIs, we leave it alone and introduce a new XU_NGROUPS value which is always 16 and is to be used or NGRPS as appropriate for things such as NFS which need to use no more than 16 groups. When feasible, truncate the group list rather than generating an error. Minor changes: - Reduce the number of hand rolled versions of groupmember(). - Do not assign to both cr_gid and cr_groups[0]. - Modify ipfw to cache ucreds instead of part of their contents since they are immutable once referenced by more than one entity. Submitted by: Isilon Systems (initial implementation) X-MFC after: never PR: bin/113398 kern/133867
194387	17-Jun-2009	snb	Keep dirhash tailq locked throughout the entirety of ufsdirhash_destroy() to fix a potential race pointed out by pjd. Also use TAILQ_FOREACH_SAFE to iterate over dirhashes in ufsdirhash_lowmem(), so that we can continue iterating even after a dirhash is destroyed. Suggested by: pjd Tested by: pho Approved by: dwmalone (mentor)
194296	16-Jun-2009	kib	Do not use casts (int )0 and (struct thread )0 for the arguments of vn_rdwr, use NULL. Reviewed by: jhb MFC after: 1 week
193511	05-Jun-2009	rwatson	Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd
193375	03-Jun-2009	snb	Add vm_lowmem event handler for dirhash. This will cause dirhashes to be deleted when the system is low on memory. This ought to allow an increase to vfs.ufs.dirhash_maxmem on machines that have lots of memory, without degrading performance by having too much memory reserved for dirhash when other things need it. The default value for dirhash_maxmem is being kept at 2MB for now, though. This work was mostly done during the 2008 Google Summer of Code. Approved by: dwmalone (mentor), re MFC after: 3 months
193307	02-Jun-2009	attilio	Handle lock recursion differenty by always checking against LO_RECURSABLE instead the lock own flag itself. Tested by: pho
192895	27-May-2009	jamie	Add hierarchical jails. A jail may further virtualize its environment by creating a child jail, which is visible to that jail and to any parent jails. Child jails may be restricted more than their parents, but never less. Jail names reflect this hierarchy, being MIB-style dot-separated strings. Every thread now points to a jail, the default being prison0, which contains information about the physical system. Prison0's root directory is the same as rootvnode; its hostname is the same as the global hostname, and its securelevel replaces the global securelevel. Note that the variable "securelevel" has actually gone away, which should not cause any problems for code that properly uses securelevel_gt() and securelevel_ge(). Some jail-related permissions that were kept in global variables and set via sysctls are now per-jail settings. The sysctls still exist for backward compatibility, used only by the now-deprecated jail(2) system call. Approved by: bz (mentor)
192586	22-May-2009	trasz	Make 'struct acl' larger, as required to support NFSv4 ACLs. Provide compatibility interfaces in both kernel and libc. Reviewed by: rwatson
192260	17-May-2009	alc	Introduce vfs_bio_set_valid() and use it from ffs_realloccg(). This eliminates the misuse of vfs_bio_clrbuf() by ffs_realloccg(). In collaboration with: tegge
191990	11-May-2009	attilio	Remove the thread argument from the FSD (File-System Dependent) parts of the VFS. Now all the VFS_* functions and relating parts don't want the context as long as it always refers to curthread. In some points, in particular when dealing with VOPs and functions living in the same namespace (eg. vflush) which still need to be converted, pass curthread explicitly in order to retain the old behaviour. Such loose ends will be fixed ASAP. While here fix a bug: now, UFS_EXTATTR can be compiled alone without the UFS_EXTATTR_AUTOSTART option. VFS KPI is heavilly changed by this commit so thirdy parts modules needs to be recompiled. Bump __FreeBSD_version in order to signal such situation.
191940	09-May-2009	kan	Do not embed struct ucred into larger netcred parent structures. Credential might need to hang around longer than its parent and be used outside of mnt_explock scope controlling netcred lifetime. Use separate reference-counted ucred allocated separately instead. While there, extend mnt_explock coverage in vfs_stdexpcheck and clean-up some unused declarations in new NFS code. Reported by: John Hickey PR: kern/133439 Reviewed by: dfr, kib
191564	27-Apr-2009	rmacklem	Change the semantics of i_modrev/va_filerev to what is required for the nfsv4 Change attribute. There are 2 changes: 1 - The value now changes on metadata changes as well as data modifications (incremented for IN_CHANGE instead of IN_UPDATE). 2 - It is now saved in spare space in the on-disk i-node so that it survives a crash. Since va_filerev is not passed out into user space, the only current use of va_filerev is in the nfs server, which uses it as the directory cookie verifier. Since this verifier is only passed back to the server by a client verbatim and then the server doesn't check it, changing the semantics should not break anything currently in FreeBSD. Reviewed by: bde Approved by: kib (mentor)
191315	20-Apr-2009	kib	In ufs_checkpath(), recheck that '..' still points to the inode with the same inode number after VFS_VGET() and relock of the vp. If '..' changed, redo the lookup. To reduce code duplication, move the code to read '..' dirent into the static helper function ufs_dir_dd_ino(). Supply the source inode number as an argument to ufs_checkpath() instead of the source inode itself. The inode is unlocked, thus it might be reclaimed, causing accesses to the freed memory. Use vn_vget_ino() to get the '..' vnode by its inode number, instead of directly code VFS_VGET() and relock, to properly busy the mount point while vp lock is dropped. Noted and reviewed by: tegge Tested by: pho MFC after: 1 month
191260	19-Apr-2009	kib	When verifying '..' after VFS_VGET() in ufs_lookup(), do not return error if '..' is still there but changed between lookup and check. Start relookup instead. Rename is supposed to change '..' reference atomically, so transient failures introduced by r191137 are wrong. While rearranging the code to allow lookup restart in ufs_lookup(), remove the comment that only distracts the reader. Noted and reviewed by: tegge Also reported by: pho MFC after: 1 month
191249	18-Apr-2009	trasz	Use acl_alloc() and acl_free() instead of using uma(9) directly. This will make switching to malloc(9) easier; also, it would be neccessary to add these routines if/when we implement variable-size ACLs.
191137	16-Apr-2009	kib	Verify that '..' still exists with the same inode number after VFS_VGET() has returned in ufs_lookup(). If the '..' lookup started immediately before the parent directory was removed, we might return either cleared or unrelated inode otherwise. Ufs_lookup() is split into new function ufs_lookup_() that either does lookup, or verifies that directory entry exists and references supplied inode number. Reviewed by: tegge Tested by: pho, Andreas Tobler <andreast-list fgznet ch> (previous version) MFC after: 1 month
190888	10-Apr-2009	rwatson	Remove VOP_LEASE and supporting functions. This hasn't been used since the removal of NQNFS, but was left in in case it was required for NFSv4. Since our new NFSv4 client and server can't use it for their requirements, GC the old mechanism, as well as other unused lease- related code and interfaces. Due to its impact on kernel programming and binary interfaces, this change should not be MFC'd. Proposed by: jeff Reviewed by: jeff Discussed with: rmacklem, zach loafman @ isilon
190690	04-Apr-2009	kib	When removing or renaming snaphost, do not delve into request_cleanup(). The later may need blocks from the underlying device that belongs to normal files, that should not be locked while snap lock is held. Reported and tested by: pho MFC after: 1 month
190469	27-Mar-2009	kib	Correct typo. Noted by: kensmith
189878	16-Mar-2009	kib	Fix two issues with bufdaemon, often causing the processes to hang in the "nbufkv" sleep. First, ffs background cg group block write requests a new buffer for the shadow copy. When ffs_bufwrite() is called from the bufdaemon due to buffers shortage, requesting the buffer deadlock bufdaemon. Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk to not block while allocating the buffer, and return failure instead. Add a flag argument to the geteblk to allow to pass the flags to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer allocation failed and either GB_NOWAIT_BD is specified, or geteblk() is called from bufdaemon (or its helper, see below). In ffs_bufwrite(), fall back to synchronous cg block write if shadow block allocation failed. Since r107847, buffer write assumes that vnode owning the buffer is locked. The second problem is that buffer cache may accumulate many buffers belonging to limited number of vnodes. With such workload, quite often threads that own the mentioned vnodes locks are trying to read another block from the vnodes, and, due to buffer cache exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make any substantial progress because the vnodes are locked. Allow the threads owning vnode locks to help the bufdaemon by doing the flush pass over the buffer cache before getnewbuf() is going to uninterruptible sleep. Move the flushing code from buf_daemon() to new helper function buf_do_flush(), that is called from getnewbuf(). The number of buffers flushed by single call to buf_do_flush() from getnewbuf() is limited by new sysctl vfs.flushbufqtarget. Prevent recursive calls to buf_do_flush() by marking the bufdaemon and threads that temporarily help bufdaemon by TDP_BUFNEED flag. In collaboration with: pho Reviewed by: tegge (previous version) Tested by: glebius, yandex ... MFC after: 3 weeks
189737	12-Mar-2009	kib	The non-modifying EA VOPs are executed with only shared vnode lock taken. Provide a custom lock around initializing and tearing down EA area, to prevent both memory leaks and double-free of it. Count the number of EA area accessors. Lock protocol requires either holding exclusive vnode lock to modify i_ea_area, or shared vnode lock and owning IN_EA_LOCKED flag in i_flag. Noted by: YAMAMOTO, Taku <taku tackymt homeip net> Tested by: pho (previous version) MFC after: 2 weeks
189706	11-Mar-2009	kib	Do not double-free the struct inode when insmntque failed. Default insmntque destructor reclaims the vnode, and ufs_reclaim frees the memory. Reviewed by: tegge MFC after: 3 days
189696	11-Mar-2009	jhb	Add a new internal mount flag (MNTK_EXTENDED_SHARED) to indicate that a filesystem supports additional operations using shared vnode locks. Currently this is used to enable shared locks for open() and close() of read-only file descriptors. - When an ISOPEN namei() request is performed with LOCKSHARED, use a shared vnode lock for the leaf vnode only if the mount point has the extended shared flag set. - Set LOCKSHARED in vn_open_cred() for requests that specify O_RDONLY but not O_CREAT. - Use a shared vnode lock around VOP_CLOSE() if the file was opened with O_RDONLY and the mountpoint has the extended shared flag set. - Adjust md(4) to upgrade the vnode lock on the vnode it gets back from vn_open() since it now may only have a shared vnode lock. - Don't enable shared vnode locks on FIFO vnodes in ZFS and UFS since FIFO's require exclusive vnode locks for their open() and close() routines. (My recent MPSAFE patches for UDF and cd9660 already included this change.) - Enable extended shared operations on UFS, cd9660, and UDF. Submitted by: ups Reviewed by: pjd (ZFS bits) MFC after: 1 month
189595	09-Mar-2009	jhb	Adjust some variables (mostly related to the buffer cache) that hold address space sizes to be longs instead of ints. Specifically, the follow values are now longs: runningbufspace, bufspace, maxbufspace, bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace, hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a relatively small number (~ 44000) of buffers set in kern.nbuf would result in integer overflows resulting either in hangs or bogus values of hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see such problems. There was a check for a nbuf setting that would cause overflows in the auto-tuning of nbuf. I've changed it to always check and cap nbuf but warn if a user-supplied tunable would cause overflow. Note that this changes the ABI of several sysctls that are used by things like top(1), etc., so any MFC would probably require a some gross shims to allow for that. MFC after: 1 month
188956	23-Feb-2009	trasz	Right now, when trying to unmount a device that's already gone, msdosfs_unmount() and ffs_unmount() exit early after getting ENXIO. However, dounmount() treats ENXIO as a success and proceeds with unmounting. In effect, the filesystem gets unmounted without closing GEOM provider etc. Reviewed by: kib Approved by: rwatson (mentor) Tested by: dho Sponsored by: FreeBSD Foundation
188954	23-Feb-2009	trasz	Refactor, moving error checking outside of the 'if (mp->mnt_flag & MNT_SOFTDEP)' conditional. No functional changes. Reviewed by: kib Approved by: rwatson (mentor) Tested by: pho Sponsored by: FreeBSD Foundation
188501	11-Feb-2009	jhb	- If the g_access() call for the initial root mount fails, then fully cleanup. Before the GEOM consumer would not have been closed. - Bump the reference on the character device being mounted while the associated devfs vnode is locked. Reviewed by: kib
188240	06-Feb-2009	trasz	When a device containing mounted UFS filesystem disappears, the type of devvp becomes VBAD, which UFS incorrectly interprets as snapshot vnode, which in turns causes panic. Fix it by replacing '!= VCHR' with '== VREG'. With this fix in place, you should no longer be able to panic the system by removing a device with an UFS filesystem mounted from it - assuming you don't use softupdates. Reviewed by: kib Tested by: pho Approved by: rwatson (mentor) Sponsored by: FreeBSD Foundation
187894	29-Jan-2009	trasz	Make sure the cdev doesn't go away while the filesystem is still mounted. Otherwise dev2udev() could return garbage. Reviewed by: kib Approved by: rwatson (mentor) Sponsored by: FreeBSD Foundation
187790	27-Jan-2009	rwatson	Following a fair amount of real world experience with ACLs and extended attributes since FreeBSD 5, make the following semantic changes: - Don't update the inode modification time (mtime) when extended attributes (and hence also ACLs) are added, modified, or removed. - Don't update the inode access tie (atime) when extended attributes (and hence also ACLs) are queried. This means that rsync (and related tools) won't improperly think that the data in the file has changed when only the ACL has changed. Note that ffs_reallocblks() has not been changed to not update on an IO_EXT transaction, but currently EAs don't use the cluster write routines so this shouldn't be a problem. If EAs grow support for clustering, then VOP_REALLOCBLKS() will need to grow a flag argument to carry down IO_EXT to UFS. MFC after: 1 week PR: ports/125739 Reported by: Alexander Zagrebin <alexz@visp.ru> Tested by: pluknet <pluknet@gmail.com>, Greg Byshenk <freebsd@byshenk.net> Discussed with: kib, kientzle, timur, Alexander Bokovoy <ab@samba.org>
187564	21-Jan-2009	jhb	Fix a few style bogons. Submitted by: bde
187528	21-Jan-2009	kib	Move the code from ufs_lookup.c used to do dotdot lookup, into the helper function. It is supposed to be useful for any filesystem that has to unlock dvp to walk to the ".." entry in lookup routine. Requested by: jhb Tested by: pho MFC after: 1 month
187526	21-Jan-2009	jhb	Move the VA_MARKATIME flag for VOP_SETATTR() out into its own VOP: VOP_MARKATIME() since unlike the rest of VOP_SETATTR(), VA_MARKATIME can be performed while holding a shared vnode lock (the same functionality is done internally by VOP_READ which can run with a shared vnode lock). Add missing locking of the vnode interlock to the ufs implementation and remove a special note and test from the NFS client about not supporting the feature. Inspired by: ups Tested by: pho
187490	20-Jan-2009	kib	The r187467 should remove all pages for V_NORMAL case too, because indirect block pages are not removed by the mentioned invocation of the vnode_pager_setsize(). Put a common code into the helper function ffs_pages_remove(). Reported and tested by: dchagin Reviewed by: ups MFC after: 3 weeks
187474	20-Jan-2009	jhb	Add a comment explaining why the "bufwait" / "dirhash" LOR reported by WITNESS will not actually result in a deadlock. Discussed with: kib MFC after: 1 week
187468	20-Jan-2009	kib	When extending inode size, we call vnode_pager_setsize(), to have a address space where to put vnode pages, and then call UFS_BALLOC(), to actually allocate new block and map it. When UFS_BALLOC() returns error, sometimes we forget to revert the vm object size increase, allowing for the pages that are not backed by the logical disk blocks. Revert vnode_pager_setsize() back when UFS_BALLOC() failed, for ffs_truncate() and ffs_write(). PR: 129956 Reviewed by: ups MFC after: 3 weeks
187467	20-Jan-2009	kib	FFS puts the extended attributes blocks at the negative blocks for the vnode, from -1 down. When vinvalbuf(vp, V_ALT) is done for the vnode, it incorrectly does vm_object_page_remove(0, 0), removing all pages from the underlying vm object, not only the pages that back the extended attributes data. Change vinvalbuf() to not remove any pages from the object when V_NORMAL or V_ALT are specified. Instead, the only in-tree caller in ffs_inode.c:ffs_truncate() that specifies V_ALT explicitely removes the corresponding page range. The V_NORMAL caller does vnode_pager_setsize(vp, 0) immediately after the call to vinvalbuf(V_NORMAL) already. Reported by: csjp Reviewed by: ups MFC after: 3 weeks
186898	08-Jan-2009	kib	Lock the uepm_lock around the autostart of extattrs. Reported and tested by: pho Reviewed by: rwatson MFC after: 3 weeks
186897	08-Jan-2009	kib	If unmount of the ffs mp failed, reinitialize the extended attributes for the mp, and restart them if autostart is enabled. Reported and tested by: pho Reviewed by: rwatson MFC after: 3 weeks
186278	18-Dec-2008	kib	Do not busy twice the mount point where a quota operation is performed. Tested by: pho MFC after: 1 month
186194	16-Dec-2008	trasz	According to phk@, VOP_STRATEGY should never, _ever_, return anything other than 0. Make it so. This fixes "panic: VOP_STRATEGY failed bp=0xc320dd90 vp=0xc3b9f648", encountered when writing to an orphaned filesystem. Reason for the panic was the following assert: KASSERT(i == 0, ("VOP_STRATEGY failed bp=%p vp=%p", bp, bp->b_vp)); at vfs_bio:bufstrategy(). Reviewed by: scottl, phk Approved by: rwatson (mentor) Sponsored by: FreeBSD Foundation
185761	08-Dec-2008	kib	The dqrele() function syncs the dq, then acquires the dqh lock, and then does final drop of the the dq reference to put it onto the free list. There is a possibility that the dq would be found by another thread after sync and before the dqh lock is acquired. If that other thread drops the dq before we have taken the dqh lock, the dirty dq is put on the free list. Recheck the DQ_MOD after the dqh lock is relocked. Repeat dqsync() if the dq is dirty. This ensures that up to date dq is written in the quota file and fixes assertion in dqget(). Reported and tested by: Frode Nordahl <frode nordahl net> MFC after: 3 days
185739	07-Dec-2008	kib	Improve usefulness of the panic by printing the pointer to the problematic dquot. In-tree gdb is often unable to get the dq value, so supply it in panic message. MFC after: 3 days
185556	02-Dec-2008	kib	Do not lock vnode interlock around reading of v_iflag to check VI_DOOMED. Read of the pointer is atomic, and flag cannot be set while vnode lock is held. Requested by: jhb MFC after: 1 month
185170	22-Nov-2008	kib	Busy ufs filesystem around block of code that does ".." lookup. Since mnt_lock is before lock of any vnode on the mp, it uses LK_NOWAIT. Since MNTK_UNMOUNT may be transient, pdp lock is dropped when vfs_busy() failed, and operation is retried after some time. This way, ffs_vget() is not called on the mp that may be in the process of being destroyed by unmount. Check for the VI_DOOMED flag on pdp after its lock is reacquired, to better detect some situations where directory containing ".." entry is removed during the lookup. Reviewed by: tegge, attilio (previous version) Tested by: pho MFC after: 1 month
185102	19-Nov-2008	jhb	Fix typo.
184934	13-Nov-2008	ambrisko	For now on every 10 cyclinder groups flush the buffer cache to free up space. If the buffer cache fills up then the disk systems can grind to a halt. Better tuning can be figured out later. Tested by: Tim, others and work Reviewed by: Kostik Belousov PR: 128832
184651	04-Nov-2008	jhb	Quiet a WITNESS warning with the dirhash sx locks by setting the DUPOK flag. Specifically, if two threads race to create a dirhash for a directory, then one might already have created a private dirhash structure (and locked it) when it realizes the directory now has a structure and tries to lock that one.
184629	04-Nov-2008	trasz	In UFS, when reading EA that contains ACL fails for some reason, include inode number and filesystem name, so the administrator can fix the problem. Approved by: rwatson (mentor)
184554	02-Nov-2008	attilio	Improve VFS locking: - Implement real draining for vfs consumers by not relying on the mnt_lock and using instead a refcount in order to keep track of lock requesters. - Due to the change above, remove the mnt_lock lockmgr because it is now useless. - Due to the change above, vfs_busy() is no more linked to a lockmgr. Change so its KPI by removing the interlock argument and defining 2 new flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the old version (which was unlinked from the lockmgr alredy) and MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx once the mnt interlock is held (ability still desired by most consumers). - The stub used into vfs_mount_destroy(), that allows to override the mnt_ref if running for more than 3 seconds, make it totally useless. Remove it as it was thought to work into older versions. If a problem of "refcount held never going away" should appear, we will need to fix properly instead than trust on such hackish solution. - Fix a bug where returning (with an error) from dounmount() was still leaving the MNTK_MWAIT flag on even if it the waiters were actually woken up. Just a place in vfs_mount_destroy() is left because it is going to recycle the structure in any case, so it doesn't matter. - Remove the markercnt refcount as it is useless. This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and __FreeBSD_version will be modified accordingly. Discussed with: kib Tested by: pho
184413	28-Oct-2008	trasz	Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary to add more V* constants, and the variables changed by this patch were often being assigned to mode_t variables, which is 16 bit. Approved by: rwatson (mentor)
184408	28-Oct-2008	kib	Provide an explanation for getinoquota() call in the ufs_access vop. MFC after: 3 days
184214	23-Oct-2008	des	Fix a number of style issues in the MALLOC / FREE commit. I've tried to be careful not to fix anything that was already broken; the NFSv4 code is particularly bad in this respect.
184205	23-Oct-2008	des	Retire the MALLOC and FREE macros. They are an abomination unto style(9). MFC after: 3 months
184074	20-Oct-2008	kib	Assert that v_holdcnt is non-zero before entering lockmgr in vn_lock and ffs_lock. This cannot catch situations where holdcnt is incremented not by curthread, but I think it is useful. Reviewed by: tegge, attilio Tested by: pho MFC after: 2 weeks
183822	13-Oct-2008	kib	Sync up summary information for cylinder groups while data is already in memory during snapshot creation. This improves the results of the background fsck. Submitted by: tegge MFC after: 1 week
183754	10-Oct-2008	attilio	Remove the struct thread unuseful argument from bufobj interface. In particular following functions KPI results modified: - bufobj_invalbuf() - bufsync() and BO_SYNC() "virtual method" of the buffer objects set. Main consumers of bufobj functions are affected by this change too and, in particular, functions which changed their KPI are: - vinvalbuf() - g_vfs_close() Due to the KPI breakage, __FreeBSD_version will be bumped in a later commit. As a side note, please consider just temporary the 'curthread' argument passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP Reviewed by: kib Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
183331	24-Sep-2008	jhb	Enable shared lookups on UFS. There are some remaining issues with forced unmounts, but those are in the VFS lookup code are not UFS specific. Tested by: pho, kris
183280	22-Sep-2008	jhb	Close a race between concurrent calls to ufsdirhash_recycle() and ufsdirhash_free() introduced in my last commit by removing the dirhash about to be free'd in ufsdirhash_free() from the global dirhash list before dropping the sx lock. Tested by: kris
183212	20-Sep-2008	kib	Initialize va_flags and va_filerev properly in VOP_GETATTR(). Don't initialize va_vaflags and va_spare because they are not part of the VOP_GETATTR() API. Also don't initialize birthtime to ctime or zero. Submitted by: Jaakko Heinonen <jh saunalahti fi> Reviewed by: bde Discussed on: freebsd-fs MFC after: 1 month
183093	16-Sep-2008	jhb	Retire the 'i_reclen' field from the in-memory i-node. Previously, during a DELETE lookup operation, lookup would cache the length of the directory entry to be deleted in 'i_reclen'. Later, the actual VOP to remove the directory entry (ufs_remove, ufs_rename, etc.) would call ufs_dirremove() which extended the length of the previous directory entry to "remove" the deleted entry. However, we always read the entire block containing the directory entry when doing the removal, so we always have the directory entry to be deleted in-memory when doing the update to the directory block. Also, we already have to figure out where the directory entry that is being removed is in the block so that we can pass the component name to the dirhash code to update the dirhash. So, instead of passing 'i_reclen' from ufs_lookup() to the ufs_dirremove() routine, just read the 'd_reclen' field directly out of the entry being removed when updating the length of the previous entry in the block. This avoids a cosmetic issue of writing to 'i_reclen' while holding a shared vnode lock. It also slightly reduces the amount of side-band data passed from ufs_lookup() to operations updating a directory via the directory's i-node. Reviewed by: jeff
183080	16-Sep-2008	jhb	Fix a race with shared lookups on UFS. If the the dirhash code reached the cap on memory usage, then shared LOOKUP operations could start free'ing dirhash structures. Without these fixes, concurrent free's on the same directory could result in one of the threads blocked on a lock in a dirhash structure free'd by the other thread. - Replace the lockmgr lock in the dirhash structure with an sx lock. - Use a reference count managed with ufsdirhash_hold()/drop() to determine when to free the dirhash structures. The directory i-node holds a reference while the dirhash is attached to an i-node. Code that wishes to lock the dirhash while holding a shared vnode lock must first acquire a private reference to the dirhash while holding the vnode interlock before acquiring the dirhash sx lock. After acquiring the sx lock, it drops the private reference after checking to see if the dirhash is still used by the directory i-node.
183079	16-Sep-2008	jhb	- Only set i_offset in the parent directory's i-node during a lookup for non-LOOKUP operations. - Relax a VOP assertion for a DELETE lookup. rename() uses WANTPARENT instead of LOCKPARENT when looking up the source pathname. ufs_rename() uses a relookup() to lock the parent directory when it decides to finally remove the source path. Thus, it is ok for a DELETE with WANTPARENT set instead of LOCKPARENT to use a shared vnode lock rather than an exclusive vnode lock. Reported by: kris (2) Reviewed by: jeff
183078	16-Sep-2008	jhb	vdropl() drops the vnode interlock. Thus, the code in the QUOTA case that upgrades the vnode lock if it is share locked was dropping the interlock before actually checking VI_DOOMED. Fix this by do the vdropl() after the check and relying on it to drop the vnode interlock. Reported by: pho Reviewed by: kib MFC after: 1 week
183074	16-Sep-2008	kib	Suspend the write operations on the UFS filesystem being unmounted or remounted from rw to ro. Proposed and reviewed by: tegge In collaboration with: pho MFC after: 1 month
183073	16-Sep-2008	kib	When attempt is made to suspend a filesystem that is already syspended, wait until the current suspension is lifted instead of silently returning success immediately. The consequences of calling vfs_write() resume when not owning the suspension are not well-defined at best. Add the vfs_susp_clean() mount method to be called from vfs_write_resume(). Set it to process_deferred_inactive() for ffs, and stop calling it manually. Add the thread flag TDP_IGNSUSP that allows to bypass the suspension point in the vn_start_write. It is intended for use by VFS in the situations where the suspender want to do some i/o requiring calls to vn_start_write(), and this i/o cannot be done later. Reviewed by: tegge In collaboration with: pho MFC after: 1 month
183072	16-Sep-2008	kib	Add the ffs structures introspection functions for ddb. Show the b_dep value for the buffer in the show buffer command. Add a comand to dump the dirty/clean buffer list for vnode. Reviewed by: tegge Tested and used by: pho MFC after: 1 month
183070	16-Sep-2008	kib	When downgrading the read-write mount to read-only, do_unmount() sets MNT_RDONLY flag before the VFS_MOUNT() is called. In ufs_inactive() and ufs_itimes_locked(), UFS verifies whether the fs is read-only by checking MNT_RDONLY, but this may cause loss of the IN_MODIFIED flag for inode on the fs being remounted rw->ro. Introduce UFS_RDONLY() struct ufsmount' method that reports the value of the fs_ronly. The later is set to 1 only after the remount is finished. Reviewed by: tegge In collaboration with: pho MFC after: 1 month
183067	16-Sep-2008	kib	The struct inode *ip supplied to softdep_freefile is not neccessary the inode having number ino. In r170991, the ip was marked IN_MODIFIED, that is not quite correct. Mark only the right inode modified by checking inode number. Reviewed by: tegge In collaboration with: pho MFC after: 1 month
182721	03-Sep-2008	trasz	When calling extattr_check_cred, use V{READ,WRITE}, not I{READ,WRITE}. Approved by: rwatson (mentor)
182542	31-Aug-2008	attilio	Decontextualize vfs_busy(), vfs_unbusy() and vfs_mount_alloc() functions. Manpages are updated accordingly. Tested by: Diego Sardina <siarodx at gmail dot com>
182371	28-Aug-2008	attilio	Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread was always curthread and totally unuseful. Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
182366	28-Aug-2008	kib	In ffs_valloc(), ffs_vget() may fail because insmntque() refused to insert new vnode into the mount vnode list. Then, for the SU-enabled mount, ffs_vfree could create freefile dependency. This dependency can hang around forever since inode is not marked as IN_MODIFIED and correspondingly inodeblock may be not marked as dirty. After ffs_vget() fails, retry with FFSV_FORCEINSMQ, mark the inode as modified, and vput() it immediately. Take care of the dup alloc. Tested by: pho Reviewed by: tegge MFC after: 1 month
182365	28-Aug-2008	kib	Softdep code may need to instantiate vnode when processing dependencies. In particular, it may need this while syncing filesystem being unmounted. Since during unmount MNTK_NOINSMNTQUE flag is set, that could sometimes disallow insertion of the vnode into the vnode mount list, softdep code needs to overwrite the MNTK_NOINSMNTQUE flag. Create the ffs_vgetf() function that sets the VV_FORCEINSMQ flag for new vnode and use it consistently from the softdep code instead of ffs_vget(). Add the retry logic to the softdep_flushfiles() to flush the vnodes that could be instantiated while flushing softdep dependencies. Tested by: pho, kris Reviewed by: tegge MFC after: 1 month
182115	24-Aug-2008	kib	Put the relocked variable from the r182111 into the #ifdef QUOTA braces to prevent warning about unused var on the !QUOTA kernels. Reported by: ed MFC after: 1 week
182111	24-Aug-2008	kib	Revert the r167541: "Remove unneeded getinoquota() call in the ufs_access()." The call to getinoquota in ufs_access() serves the purpose of instantiating inode dquot from the vn_open(). Since quotas are accounted only for the inodes with already attached dquot, removal of the call prevented opened inodes from participation in the quota calculations. Since ufs_access() may be called with the vnode being only shared locked, upgrade (and then downgrade) vnode lock if calling getinoquota(). Reported by: simon at optinet com In collaboration with: pho MFC after: 1 week
181528	10-Aug-2008	kib	Revert r181345. Move the NULL pointer check to the vfs_deleteopt() function. Discussed with: rodrigc MFC after: 3 days
181345	06-Aug-2008	kib	User may do "mount -o snapshot ...", that causes new FFS mount to be performed with snapshot option, while the mp->mnt_opt is NULL. Protect against NULL pointer dereference. Noted by: Mateusz Guzik <mjguzik gmail com> MFC after: 3 days
181329	05-Aug-2008	des	ufsmount.h uses "struct\tfoo bar;", except where it doesn't. quota.h uses "struct foo\tbar;", except where it doesn't. Try to make them both agree with themselves (though not with eachother)
181327	05-Aug-2008	des	Whitespace, prototypes
181018	30-Jul-2008	jhb	Whitespace tweak.
180758	23-Jul-2008	kib	The ffs_balloc_ufs{1,2} functions call bdwrite() while having several vnode buffers locked at once. In particular, there are indirect buffers among locked ones. The bdwrite() may start the flushing to keep dirty buffer list at the bounds. If any buffer on the dirty list requires translation from logical to physical block number, code may ends up trying to lock an indirect buffer already locked in ffs_balloc_ufsX. Prevent the bdflush() activity when several buffers are locked at once by setting the TDP_INBDFUSH for the problematic code blocks. Reported and tested by: pho, Josef Buchsteiner at Juniper In collaboration with: kan MFC after: 1 month
180621	19-Jul-2008	pjd	Say hi to svn, by simplifing ffs_vget() function a bit - there is no need for a variable that is used only once.
179295	24-May-2008	rodrigc	Fix comments to replace SBSIZE with SBLOCKSIZE, since SBSIZE was renamed to SBLOCKSIZE in version 1.33 Reviewed by: mckusick
179270	24-May-2008	rodrigc	After converting the "snapshot" mount option to the MNT_SNAPSHOT flag, delete "snapshot" from the persistent mount options list. This should fix problems with doing a mount -o snapshot of a file system, followed by an NFS export of the same file system. PR: 122833 Reported by: Leon Kos <leon.kos lecad fs uni-lj si>, Jaakko Heinonen <jh saunalahti fi> MFC after: 1 month
179269	24-May-2008	rodrigc	For the following mount options, do not perform the string to flag conversions here, because we already do them further up in vfs_donmount() in vfs_mount.c async -> MNT_ASYNC force -> MNT_FORCE multilabel -> MNT_MULTILABEL noatime -> MNT_NOATIME noclusterr -> MNT_NOCLUSTERR noclusterw -> MNT_NOCLUSTERW MFC after: 1 month
179159	20-May-2008	ups	Allow VM object creation in ufs_lookup. (If vfs.vmiodirenable is set) Directory IO without a VM object will store data in 'malloced' buffers severely limiting caching of the data. Without this change VM objects for directories are only created on an open() of the directory. TODO: Inline test if VM object already exists to avoid locking/function call overhead. Tested by: kris@ Reviewed by: jeff@ Reported by: David Filo
178420	22-Apr-2008	jeff	- Use a local variable for i_ino in ufs_lookup. It is only used to communicate between two parts of this one function. This was causing problems with shared lookups as each would trash the ino value in the inode. - Remove the unused i_ino field from the inode structure.
178243	16-Apr-2008	kib	Move the head of byte-level advisory lock list from the filesystem-specific vnode data to the struct vnode. Provide the default implementation for the vop_advlock and vop_advlockasync. Purge the locks on the vnode reclaim by using the lf_purgelocks(). The default implementation is augmented for the nfs and smbfs. In the nfs_advlock, push the Giant inside the nfs_dolock. Before the change, the vop_advlock and vop_advlockasync have taken the unlocked vnode and dereferenced the fs-private inode data, racing with with the vnode reclamation due to forced unmount. Now, the vop_getattr under the shared vnode lock is used to obtain the inode size, and later, in the lf_advlockasync, after locking the vnode interlock, the VI_DOOMED flag is checked to prevent an operation on the doomed vnode. The implementation of the lf_purgelocks() is submitted by dfr. Reported by: kris Tested by: kris, pho Discussed with: jeff, dfr MFC after: 2 weeks
178110	11-Apr-2008	jeff	- Use a lockmgr lock rather than a mtx to protect dirhash. This lock may be held for the duration of the various dirhash operations which avoids many complex unlock/lock/revalidate sequences. - Permit shared locks on lookup. To protect the ip->i_dirhash pointer we use the vnode interlock in the shared case. Callers holding the exclusive vnode lock can run without fear of concurrent modification to i_dirhash. - Hold an exclusive dirhash lock when creating the dirhash structure for the first time or when re-creating a dirhash structure which has been recycled. Tested by: kris, pho
178109	11-Apr-2008	jeff	- cache dp->i_offset in the local 'i_offset' variable for use in loop indexes so directory lookup becomes shared lock safe. In the modifying cases an exclusive lock is held here so the commit routine may rely on the state of i_offset. - Similarly handle i_diroff by fetching at the start and setting only once the operation is complete. Without the exclusive lock these are only considered hints. - Assert that an exclusive lock is held when we're preparing for a commit routine. - Honor the lock type request from lookup instead of always using exclusive locking. Tested by: pho, kris
177983	07-Apr-2008	pjd	Correct function name in panic(). Reported by: kensmith
177957	06-Apr-2008	attilio	Optimize lockmgr in order to get rid of the pool mutex interlock, of the state transitioning flags and of msleep(9) callings. Use, instead, an algorithm very similar to what sx(9) and rwlock(9) alredy do and direct accesses to the sleepqueue(9) primitive. In order to avoid writer starvation a mechanism very similar to what rwlock(9) uses now is implemented, with the correspective per-thread shared lockmgrs counter. This patch also adds 2 new functions to lockmgr KPI: lockmgr_rw() and lockmgr_args_rw(). These two are like the 2 "normal" versions, but they both accept a rwlock as interlock. In order to realize this, the general lockmgr manager function "__lockmgr_args()" has been implemented through the generic lock layer. It supports all the blocking primitives, but currently only these 2 mappers live. The patch drops the support for WITNESS atm, but it will be probabilly added soon. Also, there is a little race in the draining code which is also present in the current CVS stock implementation: if some sharers, once they wakeup, are in the runqueue they can contend the lock with the exclusive drainer. This is hard to be fixed but the now committed code mitigate this issue a lot better than the (past) CVS version. In addition assertive KA_HELD and KA_UNHELD have been made mute assertions because they are dangerous and they will be nomore supported soon. In order to avoid namespace pollution, stack.h is splitted into two parts: one which includes only the "struct stack" definition (_stack.h) and one defining the KPI. In this way, newly added _lockmgr.h can just include _stack.h. Kernel ABI results heavilly changed by this commit (the now committed version of "struct lock" is a lot smaller than the previous one) and KPI results broken by lockmgr_rw() / lockmgr_args_rw() introduction, so manpages and __FreeBSD_version will be updated accordingly. Tested by: kris, pho, jeff, danger Reviewed by: jeff Sponsored by: Google, Summer of Code program 2007
177785	31-Mar-2008	kib	Add the support for the AT_FDCWD and fd-relative name lookups to the namei(9). Based on the submission by rdivacky, sponsored by Google Summer of Code 2007 Reviewed by: rwatson, rdivacky Tested by: pho
177779	31-Mar-2008	jeff	- Since rev 1.142 of ffs_snapshot.c the interlock has not been required to protect the v_lock pointer. Removing the interlock acquisition here allows vn_lock() to proceed without requiring the interlock at all. - If the lock mutated while we were sleeping on it the interlock has been dropped. It is conceivable that the upper layer code was relying on the interlock and LK_NOWAIT to protect the identity or state of the vnode while acquiring the lock. In this case return EBUSY rather than trying the new lock to prevent potential races. Reviewed by: tegge
177778	31-Mar-2008	jeff	- Don't free snapdata structures when they are no longer in use. Keeping the lockmgr lock valid allows us to switch the v_lock pointer in snapshot vnodes between the embedded lockmgr lock and snapdata lock without needing the vnode interlock to protect against races - Keep unused snapdata structures in a list. - Add a function to lock the devvp and allocate a snapdata to it or acquire a new one without races. The old function was safe from creation races because we set the mount flag when creating snapshots and thus serializing them. However, it might have been subject to destroying races. Reviewed by: tegge
177645	26-Mar-2008	jhb	Fix a nit with the 'nofoo' options where 'foo' is mapped to 'nonofoo' (such as 'atime' vs 'noatime'). The filesystems will always see either 'nofoo' or 'nonofoo', never plain 'foo'. As such, their list of valid mount options should include 'nofoo' instead of 'foo'. With this fix, you can do 'mount -u -o atime' on a FFS filesystem that isn't marked as noatime without getting an error. You can also update a noatime FFS filesystem mounted via mount(2) (e.g. 6.x /sbin/mount binary) to 'atime' using nmount(2) (e.g. 7.x /sbin/mount binary). MFC after: 1 week Reviewed by: crodig
177633	26-Mar-2008	dfr	Add the new kernel-mode NFS Lock Manager. To use it instead of the user-mode lock manager, build a kernel with the NFSLOCKD option and add '-k' to 'rpc_lockd_flags' in rc.conf. Highlights include: * Thread-safe kernel RPC client - many threads can use the same RPC client handle safely with replies being de-multiplexed at the socket upcall (typically driven directly by the NIC interrupt) and handed off to whichever thread matches the reply. For UDP sockets, many RPC clients can share the same socket. This allows the use of a single privileged UDP port number to talk to an arbitrary number of remote hosts. * Single-threaded kernel RPC server. Adding support for multi-threaded server would be relatively straightforward and would follow approximately the Solaris KPI. A single thread should be sufficient for the NLM since it should rarely block in normal operation. * Kernel mode NLM server supporting cancel requests and granted callbacks. I've tested the NLM server reasonably extensively - it passes both my own tests and the NFS Connectathon locking tests running on Solaris, Mac OS X and Ubuntu Linux. * Userland NLM client supported. While the NLM server doesn't have support for the local NFS client's locking needs, it does have to field async replies and granted callbacks from remote NLMs that the local client has contacted. We relay these replies to the userland rpc.lockd over a local domain RPC socket. * Robust deadlock detection for the local lock manager. In particular it will detect deadlocks caused by a lock request that covers more than one blocking request. As required by the NLM protocol, all deadlock detection happens synchronously - a user is guaranteed that if a lock request isn't rejected immediately, the lock will eventually be granted. The old system allowed for a 'deferred deadlock' condition where a blocked lock request could wake up and find that some other deadlock-causing lock owner had beaten them to the lock. * Since both local and remote locks are managed by the same kernel locking code, local and remote processes can safely use file locks for mutual exclusion. Local processes have no fairness advantage compared to remote processes when contending to lock a region that has just been unlocked - the local lock manager enforces a strict first-come first-served model for both local and remote lockers. Sponsored by: Isilon Systems PR: 95247 107555 115524 116679 MFC after: 2 weeks
177528	23-Mar-2008	kib	Yield the cpu in the kernel while iterating the list of the vnodes belonging to the mountpoint. Also, yield when in the softdep_process_worklist() even when we are not going to sleep due to buffer drain. It is believed that the ULE fixed the problem [1], but the yielding seems to be needed at least for the 4BSD case. Discussed: on stable@, with bde Reviewed by: tegge, jeff [1] MFC after: 2 weeks
177493	22-Mar-2008	jeff	- Complete part of the unfinished bufobj work by consistently using BO_LOCK/UNLOCK/MTX when manipulating the bufobj. - Create a new lock in the bufobj to lock bufobj fields independently. This leaves the vnode interlock as an 'identity' lock while the bufobj is an io lock. The bufobj lock is ordered before the vnode interlock and also before the mnt ilock. - Exploit this new lock order to simplify softdep_check_suspend(). - A few sync related functions are marked with a new XXX to note that we may not properly interlock against a non-zero bv_cnt when attempting to sync all vnodes on a mountlist. I do not believe this race is important. If I'm wrong this will make these locations easier to find. Reviewed by: kib (earlier diff) Tested by: kris, pho (earlier diff)
177474	21-Mar-2008	kib	Reduce the acquisition of the vnode interlock in the ffs_read() and ffs_extread() when setting the IN_ACCESS flag by checking whether the IN_ACCESS is already set. The possible race there is admissible. Tested by: pho Submitted by: jeff
177368	19-Mar-2008	jeff	- Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from requiring the per-process spinlock to only requiring the process lock. - Reflect these changes in the proc.h documentation and consumers throughout the kernel. This is a substantial reduction in locking cost for these fields and was made possible by recent changes to threading support.
177253	16-Mar-2008	rwatson	In keeping with style(9)'s recommendations on macros, use a ';' after each SYSINIT() macro invocation. This makes a number of lightweight C parsers much happier with the FreeBSD kernel source, including cflow's prcc and lxr. MFC after: 1 month Discussed with: imp, rink
177156	13-Mar-2008	cokane	Replace the non-MPSAFE timeout(9) API in ffs_softdep.c with the MPSAFE callout_* API (e.g. callout_init_mtx(9)). This was one of the numerous items on the http://wiki.freebsd.org/SMPTODO list. Reviewed by: imp, obrien, jhb MFC after: 1 week
177034	10-Mar-2008	emaste	Remove include of opt_quota.h; as of revision 1.205 there is no longer any #ifdef QUOTA conditional code.
176831	05-Mar-2008	kib	Initialize mnt_stat.f_iosize before autostarting UFS1 extattrs. It is normally initialized by ffs_statfs() after ffs_mount finished. The extattr autostart code calls the ufs_lookup(), that uses value above to iterate over the directory blocks, see bmask initialization in the ufs_lookup() and ufsdirhash. Having the filesystem with root directory spanning more then one block would result in reading a random kernel memory. PR: kern/120781 Test case provided by: rwatson MFC after: 1 week
176797	04-Mar-2008	rwatson	Continue on-going campaign to replace lockmgr locks with sx locks where the specific semantics of ockmgr aren't required: update UFS1 extended attributes to protect its data structures using an sx lock. While here, update comments on lock granularity. MFC after: 2 weeks
176795	04-Mar-2008	rwatson	Move setting of MNTK_MPSAFE flag before UFS1 extended attribute auto-start so that the flag is set before we start performing I/O in the auto-start routine. MFC after: 2 weeks Suggested by: kib
176752	02-Mar-2008	rwatson	Don't auto-start or allow extattrctl for UFS2 file systems, as UFS2 has native extended attributes. This didn't interfere with the operation of UFS2 extended attributes, but the code shouldn't be running for UFS2. MFC after: 2 weeks
176564	25-Feb-2008	keramida	Minor typo nit.
176559	25-Feb-2008	attilio	Axe the 'thread' argument from VOP_ISLOCKED() and lockstatus() as it is always curthread. As KPI gets broken by this patch, manpages and __FreeBSD_version will be updated by further commits. Tested by: Andrea Barberio <insomniac at slackware dot it>
176519	24-Feb-2008	attilio	Introduce some functions in the vnode locks namespace and in the ffs namespace in order to handle lockmgr fields in a controlled way instead than spreading all around bogus stubs: - VN_LOCK_AREC() allows lock recursion for a specified vnode - VN_LOCK_ASHARE() allows lock sharing for a specified vnode In FFS land: - BUF_AREC() allows lock recursion for a specified buffer lock - BUF_NOREC() disallows recursion for a specified buffer lock Side note: union_subr.c::unionfs_node_update() is the only other function directly handling lockmgr fields. As this is not simple to fix, it has been left behind as "sole" exception.
176320	15-Feb-2008	attilio	- Introduce lockmgr_args() in the lockmgr space. This function performs the same operation of lockmgr() but accepting a custom wmesg, prio and timo for the particular lock instance, overriding default values lkp->lk_wmesg, lkp->lk_prio and lkp->lk_timo. - Use lockmgr_args() in order to implement BUF_TIMELOCK() - Cleanup BUF_LOCK() - Remove LK_INTERNAL as it is nomore used in the lockmgr namespace Tested by: Andrea Barberio <insomniac at slackware dot it>
175635	24-Jan-2008	attilio	Cleanup lockmgr interface and exported KPI: - Remove the "thread" argument from the lockmgr() function as it is always curthread now - Axe lockcount() function as it is no longer used - Axe LOCKMGR_ASSERT() as it is bogus really and no currently used. Hopefully this will be soonly replaced by something suitable for it. - Remove the prototype for dumplockinfo() as the function is no longer present Addictionally: - Introduce a KASSERT() in lockstatus() in order to let it accept only curthread or NULL as they should only be passed - Do a little bit of style(9) cleanup on lockmgr.h KPI results heavilly broken by this change, so manpages and FreeBSD_version will be modified accordingly by further commits. Tested by: matteo
175486	19-Jan-2008	attilio	- Introduce the function lockmgr_recursed() which returns true if the lockmgr lkp, when held in exclusive mode, is recursed - Introduce the function BUF_RECURSED() which does the same for bufobj locks based on the top of lockmgr_recursed() - Introduce the function BUF_ISLOCKED() which works like the counterpart VOP_ISLOCKED(9), showing the state of lockmgr linked with the bufobj BUF_RECURSED() and BUF_ISLOCKED() entirely replace the usage of bogus BUF_REFCNT() in a more explicative and SMP-compliant way. This allows us to axe out BUF_REFCNT() and leaving the function lockcount() totally unused in our stock kernel. Further commits will axe lockcount() as well as part of lockmgr() cleanup. KPI results, obviously, broken so further commits will update manpages and freebsd version. Tested by: kris (on UFS and NFS)
175294	13-Jan-2008	attilio	VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in conjuction with 'thread' argument passing which is always curthread. Remove the unuseful extra-argument and pass explicitly curthread to lower layer functions, when necessary. KPI results broken by this change, which should affect several ports, so version bumping and manpage update will be further committed. Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
175202	10-Jan-2008	attilio	vn_lock() is currently only used with the 'curthread' passed as argument. Remove this argument and pass curthread directly to underlying VOP_LOCK1() VFS method. This modify makes the code cleaner and in particular remove an annoying dependence helping next lockmgr() cleanup. KPI results, obviously, changed. Manpage and FreeBSD_version will be updated through further commits. As a side note, would be valuable to say that next commits will address a similar cleanup about VFS methods, in particular vop_lock1 and vop_unlock. Tested by: Diego Sardina <siarodx at gmail dot com>, Andrea Di Pasquale <whyx dot it at gmail dot com>
175068	03-Jan-2008	kib	ffs_balloc_ufsX() routines, in the case of recovering from the failed allocation, free the indirect blocks before clearing the disk pointers, that could lead to the softupdate inconsistencies in the case of the machine or disk crash at the wrong time. Rearrange the recover code to do the ffs_blkfree() after the second ffs_syncvnode(), that clears the pointers chain. Proposed and reviewed by: tegge Tested by: Peter Holm MFC after: 3 weeks
175053	02-Jan-2008	obrien	style(9)
174973	29-Dec-2007	kib	The ffs_balloc() routines, whan allocating the indirect blocks for the inode, do the rollback in case the allocation failed (due to insufficient free space or quota limits). But, the code does leaves the buffers corresponding to the inoirect blocks on the vnode bufobj list. This causes several assertion failures (for instance, "ffs_truncate3" in ffs_truncate()) to fail, and could result in the indirect block aliasing problem, like writing the context of such blocks to random disk location. Remove the buffers from the bufobj properly. Reported and tested by: Peter Holm Reviewed by: tegge MFC after: 3 weeks
174126	01-Dec-2007	kensmith	Fix a broken check that recently became more annoying because it now gets enabled when INVARIANTS is on instead of DIAGNOSTIC (which apparently nobody uses). From Tor's description: This happens when the block range spans two block maps, the first in the inode (mapping up to NDADDR direct blocks) and the second being the first indirect block. The current check assumes that both block maps are indirect blocks. Work done by: tegge Tested by: kris, kensmith
173501	09-Nov-2007	ru	Fix build without INVARIANTS and update a comment to match a change made in previous revision.
173464	08-Nov-2007	obrien	Turn most ffs 'DIAGNOSTIC's into INVARIANTS.
172930	24-Oct-2007	rwatson	Merge first in a series of TrustedBSD MAC Framework KPI changes from Mac OS X Leopard--rationalize naming for entry points to the following general forms: mac_<object>_<method/action> mac_<object>_check_<method/action> The previous naming scheme was inconsistent and mostly reversed from the new scheme. Also, make object types more consistent and remove spaces from object types that contain multiple parts ("posix_sem" -> "posixsem") to make mechanical parsing easier. Introduce a new "netinet" object type for certain IPv4/IPv6-related methods. Also simplify, slightly, some entry point names. All MAC policy modules will need to be recompiled, and modules not updates as part of this commit will need to be modified to conform to the new KPI. Sponsored by: SPARTA (original patches against Mac OS X) Obtained from: TrustedBSD Project, Apple Computer
172836	20-Oct-2007	julian	Rename the kthread_xxx (e.g. kthread_create()) calls to kproc_xxx as they actually make whole processes. Thos makes way for us to add REAL kthread_create() and friends that actually make theads. it turns out that most of these calls actually end up being moved back to the thread version when it's added. but we need to make this cosmetic change first. I'd LOVE to do this rename in 7.0 so that we can eventually MFC the new kthread_xxx() calls.
172697	16-Oct-2007	alfred	Get rid of qaddr_t. Requested by: bde
172113	10-Sep-2007	bz	Fix a DIV0 in case a large value for fs_avgfilesize or fs_avgfpdir is given (with newfs or tunefs) and dirsize overflows. In case dirsize is <= 0 because of an overflow set maxcontigdirs to 0 so it will be 1 later. This is what would happen for large fs_avgfilesize. [1] Identified with help from: roberto, pjd Submitted by: pjd [1] Approved by: re (rwatson) MFC after: 8 days
171437	13-Jul-2007	rodrigc	Perform range check before allocating memory when reading extended attributes. Reviewed by: kib Approved by: re (hrs) PR: 114389
171147	02-Jul-2007	peter	Fix an annoying pointer/int cast warning that shows up on 64 bit systems. Approved by: re
170991	22-Jun-2007	kib	Fix livelock that could occur when snapshoting UFS with quotas, where some quota limit was exceeded. Sequence of UFS_VALLOC()/UFS_VFREE() call there could cause inodeblock to have both freefile and inodedep dependencies without any inode in the block being marked for write. Then, softdep_check_suspend() would return EAGAIN forewer. Force write of inodeblock with allocated freefile softdependency by setting IN_MODIFIED flag in softdep_freefile and unconditionally calling UFS_UPDATE() in ufs_reclaim. Reported by: kris Debug help and tested by: Peter Holm Approved by: re (kensmith) MFC after: 3 weeks
170587	12-Jun-2007	rwatson	Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in some cases, move to priv_check() if it was an operation on a thread and no other flags were present. Eliminate caller-side jail exception checking (also now-unused); jail privilege exception code now goes solely in kern_jail.c. We can't yet eliminate suser() due to some cases in the KAME code where a privilege check is performed and then used in many different deferred paths. Do, however, move those prototypes to priv.h. Reviewed by: csjp Obtained from: TrustedBSD Project
170307	05-Jun-2007	jeff	Commit 14/14 of sched_lock decomposition. - Use thread_lock() rather than sched_lock for per-thread scheduling sychronization. - Use the per-process spinlock rather than the sched_lock for per-process scheduling synchronization. Tested by: kris, current@ Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc. Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
170183	01-Jun-2007	kib	Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation argument from being file descriptor index into the pointer to struct file: part 2. Convert calls missed in the first big commit. Noted by: rwatson Pointy hat to: kib
170174	01-Jun-2007	jeff	- Move rusage from being per-process in struct pstats to per-thread in td_ru. This removes the requirement for per-process synchronization in statclock() and mi_switch(). This was previously supported by sched_lock which is going away. All modifications to rusage are now done in the context of the owning thread. reads proceed without locks. - Aggregate exiting threads rusage in thread_exit() such that the exiting thread's rusage is not lost. - Provide a new routine, rufetch() to fetch an aggregate of all rusage structures from all threads in a process. This routine must be used in any place requiring a rusage from a process prior to it's exit. The exited process's rusage is still available via p_ru. - Aggregate tick statistics only on demand via rufetch() or when a thread exits. Tick statistics are kept in the thread and protected by sched_lock until it exits. Initial patch by: attilio Reviewed by: attilio, bde (some objections), arch (mostly silent)
170152	31-May-2007	kib	Revert UF_OPENING workaround for CURRENT. Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation argument from being file descriptor index into the pointer to struct file. Proposed and reviewed by: jhb Reviewed by: daichi (unionfs) Approved by: re (kensmith)
170041	28-May-2007	pjd	- Remove unnecessary vnode internal locking - v_vflag is protect by vnode's lock (not vnode's interlock). - Simplify code a bit.
169898	23-May-2007	pjd	Eliminate VI_LOCK()/VI_UNLOCK() pair from getattr and close code paths. It's hard to measure performance improvement on my test machine, but the change won't degrade performance for sure. I can measure slight improvement for debugging kernel and it can also be a win for machines where atomic operation is more expensive. Reviewed by: kib
169671	18-May-2007	kib	Since renaming of vop_lock to _vop_lock, pre- and post-condition function calls are no more generated for vop_lock. Rename _vop_lock to vop_lock1 to satisfy tools/vnode_if.awk assumption about vop naming conventions. This restores pre/post-condition calls.
169239	03-May-2007	thompsa	Add a newline to the printf message.
168576	10-Apr-2007	kib	Fix the NAMEI zone leak when snapshot was successfully created. Reported and tested by: Peter Holm MFC after: 2 weeks
168575	10-Apr-2007	kib	Recalculate the NEWBLOCK flag for pagedep structure after the softdep lock is dropped, since pagedep may be already processed and deallocated. Found and tested by: kris MFC after: 2 weeks
168574	10-Apr-2007	kib	When LK_NOWAIT is passed as argument to process_worklist_item(), this does not prevent handle_workitem_remove() from recursing into a blocking version. Add the dirrem to worklist instead of processing it now if this is the case. Reported and tested by: kris Submitted by: tegge MFC after: 2 weeks
168353	04-Apr-2007	delphij	Use *_EMPTY macros when appropriate.
168021	29-Mar-2007	kib	Revert rev. 1.205. Replace unconditional acquision of Giant when QUOTAS are defined with VFS_LOCK_GIANT(NULL) call. This shall fix softdep operation when mpsafe_vfs = 0. Reported and tested by: kris Submitted by: tegge MFC after: 1 week
167737	20-Mar-2007	kib	Mark UFS as being MP-Safe in "options QUOTA" case too. Remove no more neccessary Giant acquisions in softdepend processing code. Tested by: Peter Holm Reviewed by: tegge Approved by: re (kensmith)
167719	19-Mar-2007	brian	When we write extended attributes, assert that the inode hasn't already been deleted. The assertion is important to show that we won't end up accounting for extended attribute blocks (using fs_pendingblocks) in our subsequent call to fs_alloc(). Agreed verbally by: mckusick MFC after: 3 weeks
167543	14-Mar-2007	kib	Implement fine-grained locking for UFS quotas. Each struct dquot gets dq_lock mutex to protect dq_flags and to interlock with DQ_LOCK. qhash, dqfreelist and dq.dq_cnt are protected by global dqhlock mutex. i_dquot array for inode is protected by lockmgr' vnode lock, corresponding assert added to the dqget(). Access to struct ufsmount quota-related fields (um_quotas and um_qflags) is protected by um_lock. Tested by: Peter Holm Reviewed by: tegge Approved by: re (kensmith) This work were not possible without enormous amount of help given by Tor Egge and Peter Holm. Tor reviewed each version of patch, pointed out numerous errors and provided invaluable suggestions. Peter did tireless testing of the patch as it was developed.
167542	14-Mar-2007	kib	Call getinoquota() before allocating new block for the directory to properly account for block allocation. Tested by: Peter Holm Reviewed by: tegge Approved by: re (kensmith)
167541	14-Mar-2007	kib	Remove unneeded getinoquota() call in the ufs_access(). Tested by: Peter Holm Reviewed by: tegge Approved by: re (kensmith)
167497	13-Mar-2007	tegge	Make insmntque() externally visibile and allow it to fail (e.g. during late stages of unmount). On failure, the vnode is recycled. Add insmntque1(), to allow for file system specific cleanup when recycling vnode on failure. Change getnewvnode() to no longer call insmntque(). Previously, embryonic vnodes were put onto the list of vnode belonging to a file system, which is unsafe for a file system marked MPSAFE. Change vfs_hash_insert() to no longer lock the vnode. The caller now has that responsibility. Change most file systems to lock the vnode and call insmntque() or insmntque1() after a new vnode has been sufficiently setup. Handle failed insmntque*() calls by propagating errors to callers, possibly after some file system specific cleanup. Approved by: re (kensmith) Reviewed by: kib In collaboration with: kib
167259	06-Mar-2007	mckusick	Move macros describing extended attributes in UFS from <sys/extattr.h> to <ufs/ufs/extattr.h>. Move description of extended attributes in UFS from man9/extattr.9 to man5/fs.5. Note that restore will not compile until <sys/extattr.h> and <ufs/ufs/extattr.h> have been updated. Suggested by: Robert Watson
167155	01-Mar-2007	pjd	Fix build breakage.
167154	01-Mar-2007	pjd	Change: "... try to use VADMIN in preference to VADMIN ..." To: "... try to use VADMIN in preference to VWRITE ..."
167152	01-Mar-2007	pjd	Rename PRIV_VFS_CLEARSUGID to PRIV_VFS_RETAINSUGID, which seems to better describe the privilege. OK'ed by: rwatson
167151	01-Mar-2007	pjd	Avoid checking for privileges if there is no need to. Discussed with: rwatson
166924	23-Feb-2007	brian	Account for di_blocks allocations when IN_SPACECOUNTED is set in an inode's i_flag. It's possible that after ufs_infactive() calls softdep_releasefile(), i_nlink stays >0 for a considerable amount of time (> 60 seconds here). During this period, any ffs allocation routines that alter di_blocks must also account for the blocks in the filesystem's fs_pendingblocks value. This change fixes an eventual df/du discrepency that will happen as the result of fs_pendingblocks being reduced to <0. The only manifestation of this that people may recognise is the following message on boot: /somefs: update error: blocks -N files M at which point the negative pending block count is adjusted to zero. Reviewed by: tegge MFC after: 3 weeks
166864	21-Feb-2007	mckusick	The functions that set and delete external attributes must check that the filesystem is not mounted read-only before proceeding. Reported by: Ryan Beasley <ryanb@FreeBSD.org> MFC after: 1 week
166832	19-Feb-2007	rwatson	Rename three quota privileges from the UFS privilege namespace to the VFS privilege namespace: exceedquota, getquota, and setquota. Leave UFS-specific quota configuration privileges in the UFS name space. This renumbers VFS and UFS privileges, so requires rebuilding modules if you are using security policies aware of privilege identifiers. This is likely no one at this point since none of the committed MAC policies use the privilege checks.
166831	19-Feb-2007	rwatson	Limit quota privileges in jail to PRIV_UFS_GETQUOTA and PRIV_UFS_SETQUOTA.
166799	17-Feb-2007	mckusick	This README file is obsolete. The cited problems were fixed long ago and the code is installed by default so no longer requires action by the administrator to be included.
166774	15-Feb-2007	pjd	Move vnode-to-file-handle translation from vfs_vptofh to vop_vptofh method. This way we may support multiple structures in v_data vnode field within one file system without using black magic. Vnode-to-file-handle should be VOP in the first place, but was made VFS operation to keep interface as compatible as possible with SUN's VFS. BTW. Now Solaris also implements vnode-to-file-handle as VOP operation. VFS_VPTOFH() was left for API backward compatibility, but is marked for removal before 8.0-RELEASE. Approved by: mckusick Discussed with: many (on IRC) Tested with: ufs, msdosfs, cd9660, nullfs and zfs
166743	15-Feb-2007	kib	Style(9).
166564	08-Feb-2007	kib	Remove not needed acquision of the mount interlock aroung reading of mnt_kern_flags in ufs_itimes(). Suggested by: ssouhlal Confirmed by: tegge MFC after: 2 weeks
166506	04-Feb-2007	tegge	Call pbgetvp() and pbrelvp() instead of setting b_vp directly. PR: kern/108151
166487	04-Feb-2007	mpp	If quotacheck or edquota reset the block or inode grace time for a user or group, when the kernel first sees this, it will update the grace time value. However, it never flags the quota as modified and the updated value never makes it to the quota data file unless the user actually makes some other change that would write the data out. Fixed to flag the quota as modified if the soft limit has actually been reached and should be now enforced.
166381	01-Feb-2007	mpp	Prevent quotactl calls that pass in an id of -1 from incorrectly using the callers UID instead of the GID when performing group operations. This could allow users to determine group quota information for groups they are not a member of in some cases. Rename the "uid" parameter in ufs_quotactl to "id" to better show that it is used for more than just the uid, and to be more in line with the naming conventions in the other quota routines. PR: kern/33940
166380	01-Feb-2007	mpp	Disallow negative UIDs when processing quotactl options.
166193	23-Jan-2007	kib	Cylinder group bitmaps and blocks containing inode for a snapshot file are after snaplock, while other ffs device buffers are before snaplock in global lock order. By itself, this could cause deadlock when bdwrite() tries to flush dirty buffers on snapshotted ffs. If, during the flush, COW activity for snapshot needs to allocate block and ffs_alloccg() selects the cylinder group that is being written by bdwrite(), then kernel would panic due to recursive buffer lock acquision. Avoid dealing with buffers in bdwrite() that are from other side of snaplock divisor in the lock order then the buffer being written. Add new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in the bdwrite(). Default implementation, bufbdflush(), refactors the code from bdwrite(). For ffs device buffers, specialized implementation is used. Reviewed by: tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes) Tested by: Peter Holm X-MFC after: 3 weeks (if ever: it changes ABI)
166146	20-Jan-2007	delphij	Fix build. chkdquot() should not return anything.
166142	20-Jan-2007	mpp	Quota system cleanup. 1) Do not do quota accounting for the actual quota data files or for file system snapshot files ("system" files). This prevents a deadlock descibed in PR kern/30958 if the kernel ever has to grow the quota file. Snapshot files were already exempt from the quota checks, but this change generalized the check. 2) Fix a cast that caused extremely large uids/gids to incorrectly write the quota information to the data file at a truncated value for a uint_t32 id value. The incorrect cast caused quota files in this case to be around 4GB in size, with the correct cast they can now be 131GB in size. Also related to PR kern/30958. 3) Check for what appear to be negative UIDs/GIDs and not account for them. This prevents the quota files from becoming 131GB in size and causing quotacheck to run forever at bootup. This could also cause the kernel to try and expand the quota file, which might deadlock due to the issue in #1. kern/30958 and kern/38156 (and some much older closed PR's). 4) With the deadlock problems gone, the kernel can now expand the size of the quota database files if it needs to. 5) Pass in the i-node count change value to chkiq and chkiqchg as an int, like it used to be before the common routine was split up into 2 different routines to increase / decrease the i-node in-use count. Prevents an underflow on the i-node count. Related to PR kern/89247. 6) Prevent the block usage from growing slowly if a file system is full and the write was denied due to that fact. PR kern/89247. Some of these changes require an updated quotacheck to prevent the creation of huge (131GB) quota data files (item #3). #1/#4 probably fixes a lot of the random hangs when quotas are enabled, possibly some of the jail hangs.
166052	16-Jan-2007	mpp	Fix a spelling error. heirarchy -> hierarchy. Obtained from: OpenBSD
166051	16-Jan-2007	mpp	Fix a spelling error in some comments. heirarchy -> hierarchy. Obtained from: OpenBSD
165890	08-Jan-2007	rwatson	Canonicalize copyright: use a date range rather than comma-delimited list. MFC after: 3 days
164248	13-Nov-2006	kmacy	change vop_lock handling to allowing tracking of callers' file and line for acquisition of lockmgr locks Approved by: scottl (standing in for mentor rwatson)
164033	06-Nov-2006	rwatson	Sweep kernel replacing suser(9) calls with priv(9) calls, assigning specific privilege names to a broad range of privileges. These may require some future tweaking. Sponsored by: nCircle Network Security, Inc. Obtained from: TrustedBSD Project Discussed on: arch@ Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri, Alex Lyashkov <umka at sevcity dot net>, Skip Ford <skip dot ford at verizon dot net>, Antoine Brodin <antoine dot brodin at laposte dot net>
163874	01-Nov-2006	kib	Aquire Giant in the softdep_flush for clear_remove() and clear_inodedeps() processing when QUOTA is set. Reported and tested by: Peter Holm Reviewed by: tegge MFC after: 3 days
163841	31-Oct-2006	pjd	Add gjournal specific code to the UFS file system: - Add FS_GJOURNAL flag which enables gjournal support on a file system. - Add cg_unrefs field to the cylinder group structure which holds number of unreferenced (orphaned) inodes in the given cylinder group. - Add fs_unrefs field to the super block structure which holds total number of unreferenced (orphaned) inodes. - When file or a directory is orphaned (last reference is removed, but object is still open), increase fs_unrefs and cg_unrefs fields, which is a hint for fsck in which cylinder groups looks for such (orphaned) objects. - When file is last closed, decrease {fs,cg}_unrefs fields. - Add VV_DELETED vnode flag which points at orphaned objects. Sponsored by: home.pl
163606	22-Oct-2006	rwatson	Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now contains the userspace and user<->kernel API and definitions, with all in-kernel interfaces moved to mac_framework.h, which is now included across most of the kernel instead. This change is the first step in a larger cleanup and sweep of MAC Framework interfaces in the kernel, and will not be MFC'd. Obtained from: TrustedBSD Project Sponsored by: SPARTA
163194	10-Oct-2006	kib	Do not translate the IN_ACCESS inode flag into the IN_MODIFIED while filesystem is suspending/suspended. Doing so may result in deadlock. Instead, set the (new) IN_LAZYACCESS flag, that becomes IN_MODIFIED when suspend is lifted. Change the locking protocol in order to set the IN_ACCESS and timestamps without upgrading shared vnode lock to exclusive (see comments in the inode.h). Before that, inode was modified while holding only shared lock. Tested by: Peter Holm Reviewed by: tegge, bde Approved by: pjd (mentor) MFC after: 3 weeks
162942	02-Oct-2006	tegge	Correct check for when IO_SYNC should be set for filesystem not using softupdates when truncating a directory to zero length. Discussed with: bde
162654	26-Sep-2006	tegge	Protect change to bo_flag by holding the bufobj mutex.
162653	26-Sep-2006	tegge	Reduce fluctuations of mnt_flag to allow unlocked readers to get a slightly more consistent view.
162652	26-Sep-2006	tegge	Don't restore MNT_QUOTA bit in mnt_flag after snapshot creation, closing a race between nmount() and quotactl().
162650	26-Sep-2006	tegge	Increase mnt_noasync once in softdep_mount() to disallow async io, closing a window where a file system using softupdates could be async for a short while if both MNT_UPDATE and MNT_ASYNC were passed as flags to nmount(). Add MNTK_SOFTDEP flag to ensure that softdep_mount() doesn't increase mnt_noasync multiple times.
162649	26-Sep-2006	tegge	Add mnt_noasync counter to better handle interleaved calls to nmount(), sync() and sync_fsync() without losing MNT_ASYNC. Add MNTK_ASYNC flag which is set only when MNT_ASYNC is set and mnt_noasync is zero, and check that flag instead of MNT_ASYNC before initiating async io.
162647	26-Sep-2006	tegge	Use mount interlock to protect all changes to mnt_flag and mnt_kern_flag. This eliminates a race where MNT_UPDATE flag could be lost when nmount() raced against sync(), sync_fsync() or quotactl().
162460	20-Sep-2006	kib	Fix the glitch introduced in rev. 1.93. In softdep_sync_metadata(), switch by worklist type contains two for() loops, for D_INDIRDEP and D_PAGEDEP. On error, these loops are exited by break, where the switch actually shall be leaved. Use goto instead of break to reach the error handling code. Reported by: Peter Holm Reviewed by: tegge Approved by: pjd (mentor) MFC after: 2 weeks
162383	17-Sep-2006	rwatson	Declare security and security.bsd sysctl hierarchies in sysctl.h along with other commonly used sysctl name spaces, rather than declaring them all over the place. MFC after: 1 month Sponsored by: nCircle Network Security, Inc.
161515	21-Aug-2006	kib	While checking for update of snapshot file in the ffs_copyonwrite, first filter out metadata update. Otherwise, devfs vnode could be erronously interpreted as ufs one, causing further check of i_flags to use random memory. PR: kern/100365 Debugged and fix described by: tegge Approved by: pjd (mentor) MFC after: 2 weeks
161473	20-Aug-2006	pjd	Correct typo in comment.
160859	31-Jul-2006	obrien	Rather than print out a nice error message giving details sufficent to fix a 'ufs_dirbad' and then panicing (making it very hard to see the details), put them in the panic message itself.
160462	18-Jul-2006	stefanf	Drop two unnecessary casts.
160269	11-Jul-2006	daichi	The ufs_lookup.c has a critical bug around the whiteout process. UFS must check a whiteout name when it uses the whiteout, but the current implementation does not check the whileout name, so sometimes UFS writes over a wrong whtieout. UFS MUST check the whiteout name to use a corrent whiteout. This bug leads unionfs. panic. This commit fixes this trouble. Submitted by: Masanori Ozawa <ozawa@ongs.co.jp> (unionfs developer) Reviewed by: tegge & rodrigc (mentor) Approved by: rodrigc (mentor) MFC after: 2 weeks
160205	09-Jul-2006	pjd	Declare UFS module version.
160204	09-Jul-2006	pjd	Change fs->fs_fsmnt to mp->mnt_stat.f_mntonname in warnings about missing MAC and ACLs support in the kernel. If it is a first mount, fs->fs_fsmnt is empty. MFC after: 1 week
159209	03-Jun-2006	rodrigc	Check the sectorsize of the underlying disk before trying to bread() the UFS superblock. Should eliminate crashes when trying to do: mount -t ufs on an audio CD. PR: kern/85893 Reported by: Russell Francis <rfrancis at ev dot net> MFC after: 1 week
159109	31-May-2006	maxim	o Rearrange and remove incorrect comments. Requested by: bde
159102	31-May-2006	maxim	o According to POSIX, the result of ftruncate(2) is unspecified for file types other than VREG, VDIR and shared memory objects. We already handle VREG, VLNK and VDIR cases. Silently ignore truncate requests for all the rest. Adjust comments. PR: kern/98064 Submitted by: bde Security: local DoS Regress. test: regression/fifo/fifo_misc MFC after: 2 weeks
158952	26-May-2006	rodrigc	Remove "update" from ffs_opts. It has been moved to global_opts in vfs_mount.c.
158924	26-May-2006	rodrigc	Remove calls to vfs_export() for exporting a filesystem for NFS mounting from individual filesystems. Call it instead in vfs_mount.c, after we call VFS_MOUNT() for a specific filesystem.
158867	24-May-2006	rodrigc	Take errmsg out of ffs_opts. It is already part of global_opts in vfs_mount.c.
158802	21-May-2006	maxim	o Fix a comment: ufs2_dinode.di_blocks counts blocks not bytes actually held.
158801	21-May-2006	maxim	o Fix a comment: directory whiteout type is DT_WHT not DT_W.
158659	16-May-2006	trhodes	Provide a less cryptic panic message in place of just "found inode."
158636	16-May-2006	tegge	Read block hints list from last snapshot on the active snapshot list.
158634	15-May-2006	tegge	Copy last block on file system again after file system has been suspended. Obtained from: NetBSD
158633	15-May-2006	tegge	Don't leak a locked buffer if last block on file system cannot be read.
158632	15-May-2006	tegge	Errors detected while file system is suspended should not trigger an assertion failure.
158527	13-May-2006	tegge	Expunge traces of unlinked snapshot files when making a new snapshot.
158382	09-May-2006	tegge	Bring the call to softdep_releasefile() within the region protected by vn_start_secondary_write() since it might cause file system write activity (e.g. ffs_snapremove()).
158338	06-May-2006	tegge	ffs_syncvnode() might skip some of the blocks due to them being locked, assuming them to be inflight write buffers. This is not always the case. bufdaemon might hold the buffer lock and give up writing the buffer due to it having dependencies, the file system being suspended or the vnode lock being held by another thread. When bufdaemon decides to write the buffer there is still a window before bufobj_wref() has been called, allowing other threads to believe that the vnode has no dirty buffers or inflight writes. Try harder to flush first block of new subdirectory to get rid of MKDIR_BODY dependency.
158325	05-May-2006	tegge	Return error if vnode was reclaimed while it was temporarily unlocked. Add missing calls to vn_finished_write() in error handling.
158322	05-May-2006	tegge	Turn off disk quotas for snapshot files.
158321	05-May-2006	tegge	Avoid locking overhead when snapshots are disabled.
158308	05-May-2006	pjd	- Set bio_done directly to NULL to indicate that we want to wait for the bio. - Use biowait() instead of copying the code. MFC after: 1 month
158262	03-May-2006	tegge	Detect the snapshot file being prematurely unlinked.
158261	03-May-2006	tegge	Temporarily undo clusters contribution to global runningbufspace while handling copy on write for the buffers taking part in the cluster.
158260	03-May-2006	tegge	A side effect of calling runningbufwakeup() is that bp->b_runningbufspace is cleared. Save old value and restore bp->b_runningbufspace before returning from ffs_copyonwrite().
158259	02-May-2006	tegge	Close a race when VOP_LOCK() on a snapshot file is attempted at the same time as it is changed back into a normal file. The locker would get the shared "snaplk" lock which would no longer be the correct lock for the vnode.
158100	28-Apr-2006	scottl	Fix a typo.
158095	28-Apr-2006	jeff	- Add a BO_NEEDSGIANT flag to the bufobj. This flag forces all child buffers to go on the buf daemon's DIRTYGIANT queue. - Set BO_NEEDSGIANT on ffs's devvp since the ffs_copyonwrite handler runs in the context of the buf daemon and may require Giant.
157955	22-Apr-2006	trhodes	Revert previous to this file before an actual request is made.
157919	21-Apr-2006	trhodes	Remove what I believe are two useless ifdefs. If a user or administrator enables multilabel, or any option for that matter, most likely they have a reason. This will allow users to see that mulilabel is enabled via an issued "mount" command and remove an annoying warning - printed only when a MAC kernel is not installed - on boot up. Discussed with: green, brueffer, Samy Al Bahra. Probably ran past: csjp (though I can't remember).
157805	17-Apr-2006	kensmith	Fix panic() message to give the right function name.
157447	03-Apr-2006	tegge	Eliminate softdep_flush() livelock by accounting for number of worklist items marked as being in progress.
157325	31-Mar-2006	jeff	- Release the references acquired by VOP_GETWRITEMOUNT and vfs_getvfs(). Discussed with: tegge Tested by: kris Sponsored by: Isilon Systems, Inc.
156899	19-Mar-2006	tegge	Allow compilation when not using softupdates.
156898	19-Mar-2006	tegge	Let snapshots make a copy of old contents for all buffers taking part in a cluster instead of just the first buffer. Delay buf_start() calls until snapshots have a copy of old content. PR: kern/93942
156897	19-Mar-2006	tegge	Add kludge to avoid deadlock when unlinking snapshot.
156896	19-Mar-2006	tegge	Reduce probability of unmount failing after having unmounted snapshots.
156895	19-Mar-2006	tegge	Ensure that vnode for directory isn't reclaimed before ffs_snapshot() has completed expunging unlinked files. It could come back at another memory location causing a lock order reversal.
156589	12-Mar-2006	jeff	- Remove the call to softdep_waitidle after suspending the filesystem. This does not do what I wanted as all dirty buffers must be flushed by the call to ffs_sync and any remaining dependency work would mean that this failed. Pointed out by: tegge
156587	12-Mar-2006	jeff	- Remove the call to softdep_waitidle after suspending the filesystem. This does not do what I wanted as all dirty buffers must be flushed by the call to ffs_sync and any remaining dependency work would mean that this failed. Pointed out by: tegge
156560	11-Mar-2006	tegge	Block secondary writes while expunging active unlinked files. Fix detection of active unlinked files by checking VI_OWEINACT and VI_DOINGINACT in addition to v_usecount. Defer inactive handling for unlinked files if the file system is mostly suspended (secondary writes being blocked). Perform deferred inactive handling after the file system is resumed.
156521	10-Mar-2006	tegge	Remove unneeded (and broken) usage of MNT_REF()/MNT_REL().
156451	08-Mar-2006	tegge	Use vn_start_secondary_write() and vn_finished_secondary_write() as a replacement for vn_write_suspend_wait() to better account for secondary write processing. Close race where secondary writes could be started after ffs_sync() returned but before the file system was marked as suspended. Detect if secondary writes or softdep processing occurred during vnode sync loop in ffs_sync() and retry the loop if needed.
156418	08-Mar-2006	tegge	Don't set IN_CHANGE and IN_UPDATE on inodes for potentially suspended file systems. This could cause deadlocks when creating snapshots. Reviewed by: jeff
156225	02-Mar-2006	tegge	Eliminate a deadlock when creating snapshots. Blocking vn_start_write() must be called without any vnode locks held. Remove calls to vn_start_write() and vn_finished_write() in vnode_pager_putpages() and add these calls before the vnode lock is obtained to most of the callers that don't already have them.
156206	02-Mar-2006	jeff	- Acquire lk in softdep_slowdown so that it's owned when we call softdep_speedup(). - Assert that lk is held in softdep_speedup() rather than acquiring it. This avoids a potential lock recursion.
156203	02-Mar-2006	jeff	- Move softdep from using a global worklist to per-mount worklists. This has many positive effects including improved smp locking, reducing interdependencies between mounts that can lead to deadlocks, etc. - Add the softdep worklist and various counters to the ufsmnt structure. - Add a mount pointer to the workitem and remove mount pointers from the various structures derived from the workitem as they are now redundant. - Remove the poor-man's semaphore protecting softdep_process_worklist and softdep_flushworklist. Several threads may now process the list simultaneously. - Add softdep_waitidle() to block the thread until all pending dependencies being operated on by other threads have been flushed. - Use softdep_waitidle() in unmount and snapshots to block either operation until the fs is stable. - Remove softdep worklist processing from the syncer and move it into the softdep_flush() thread. This thread processes all softdep mounts once each second and when it is called via the new softdep_speedup() when there is a resource shortage. This removes the softdep hook from the kernel and various hacks in header files to support it. Reviewed by/Discussed with: tegge, truckman, mckusick Tested by: kris
155897	22-Feb-2006	jeff	- Using LK_NOWAIT in qsync() can get us into infinite loop situations that lead to deadlocks. Remove it. MFC After: 1 week
155572	12-Feb-2006	rwatson	In quotaoff(), lock the vnode instead of asserting it when manipulating v_vflags. MFC after: 1 week Submitted by: Antoine Brodin <antoine at brodin at laposte dot net>
155555	11-Feb-2006	rwatson	Instead of asserting the vnode lock before manipulating v_vflag, acquire it and drop it afterwards. Found by: kris MFC after: 1 week
155160	01-Feb-2006	jeff	- Reorder calls to vrele() after calls to vput() when the vrele is a directory. vrele() may lock the passed vnode, which in these cases would give an invalid lock order of child -> parent. These situations are deadlock prone although do not typically deadlock because the vrele is typically not releasing the last reference to the vnode. Users of vrele must consider it as a call to vn_lock() and order it appropriately. MFC After: 1 week Sponsored by: Isilon Systems, Inc. Tested by: kkenn
154152	09-Jan-2006	tegge	Add marker vnodes to ensure that all vnodes associated with the mount point are iterated over when using MNT_VNODE_FOREACH. Reviewed by: truckman
154150	09-Jan-2006	tegge	If the lock passed to getdirtybuf() is the softdep lock then the background write completed wakeup could be missed. Close the race by grabbing the lock normally used for protection of bp->b_xflags. Reviewed by: truckman
154149	09-Jan-2006	tegge	Broaden scope of softdep_worklist_busy rwlock protection of softdep processing to avoid some dependencies being missed by softdep_flushworklist(). Reviewed by: truckman
154065	06-Jan-2006	imp	New option: NO_FFS_SNAPSHOT. I did this in p4 about the same time that NetBSD implemented it independently of them (don't know which one was actually first). This saves about 24k for those times you don't need snapshot support (like when running off a ram disk, or in an embedded environment where size matters).
153689	23-Dec-2005	delphij	Typo.
153400	14-Dec-2005	des	Eradicate caddr_t from the VFS API.
152771	24-Nov-2005	rodrigc	Fix parsing of atime, clusterr, clusterw, exec, suid, symfollow mount options. Noticed by: Amir Shalem < amir at boom dot org dot il>
152639	20-Nov-2005	rodrigc	If export mount flag is not passed in, set default parameters for export structure and pass that to vfs_export(). Currently in userland mount(8), an export structure is unconditionally passed in, only for UFS. This is an attempt to move that UFS-specific behavior out of mount(8) and into the UFS filesystem code.
152622	19-Nov-2005	rodrigc	Add more options to ffs_opts, so that vfs_filteropts() will not complain when we pass these options to a UFS filesystem as strings via nmount(): noexec, nosuid, nosymfollow, sync, suiddir
152567	18-Nov-2005	rodrigc	- Add parsing for the following existing UFS/FFS mount options in the nmount() callpath via vfs_getopt(), and set the appropriate MNT_* flag: -> acls, async, force, multilabel, noasync, noatime, -> noclusterr, noclusterw, snapshot, update - Allow errmsg as a valid mount option via vfs_getopt(), so we can later add a hook to propagate mount errors back to userspace via vfs_mount_error().
152163	07-Nov-2005	delphij	Slightly reorganize to reduce duplicated code. Reviewed by: rwatson
151906	31-Oct-2005	ps	Rate limit filesystem full and out of inodes messages to once a second.
151897	31-Oct-2005	rwatson	Normalize a significant number of kernel malloc type names: - Prefer '_' to ' ', as it results in more easily parsed results in memory monitoring tools such as vmstat. - Remove punctuation that is incompatible with using memory type names as file names, such as '/' characters. - Disambiguate some collisions by adding subsystem prefixes to some memory types. - Generally prefer lower case to upper case. - If the same type is defined in multiple architecture directories, attempt to use the same name in additional cases. Not all instances were caught in this change, so more work is required to finish this conversion. Similar changes are required for UMA zone names.
151657	25-Oct-2005	delphij	Remove an unneeded "a" from comment.
151528	21-Oct-2005	njl	Adjust maxfilesize for UFS1 and old 4.4 FFS. For UFS1, increase the limit to (max block - 1) * bsize. For DEV_BSIZE, this doubles the limit from 0.5 TB to 1 TB. For the old 4.4 FFS case, decrease the limit from 0.5 TB to 2 GB - 1. Older systems had a 32 bit off_t so they couldn't access the larger files anyway. Collaboration with: bde
151390	16-Oct-2005	truckman	Correct the type of the temporary variable used by ufs_lookup.c:1.78 to fix the race condition in the ufs_lookup() ISDOTDOT code. Noticed by: bde MFC after: 12 days
151347	14-Oct-2005	truckman	Close a race in the ufs_lookup() code that handles the ISDOTDOT case by saving the value of dp->i_ino before unlocking the vnode for the current directory and passing the saved value to VFS_VGET(). Without this change, another thread can overwrite dp->i_ino after the current directory is unlocked, causing ufs_lookup() to lock and return the wrong vnode in place of the vnode for its parent directory. A deadlock can occur if dp->i_ino was changed to a subdirectory of the current directory because the root to leaf vnode lock ordering will be violated. A vnode lock can be leaked if dp->i_ino was changed to point to the current directory, which causes the current vnode lock for the current directory to be recursed, which confuses lookup() into calling vrele() when it should be calling vput(). The probability of this bug being triggered seems to be quite low unless the sysctl variable debug.vfscache is set to 0. Reviewed by: jhb MFC after: 2 weeks
151258	12-Oct-2005	rwatson	When performing a VOP_LOOKUP() as part of UFS1 extended attribute auto-start, set cnp.cn_lkflags to LK_EXCLUSIVE. This flag must now be set so that lockmgr knows what kind of lock to acquire, and it will panic if not specified. This resulted in a panic when using extended attributes on UFS1 as of locking work present in the 6.x branch. This is a RELENG_6_0 merge candidate. Reported by: lofi MFC after: 3 days
151252	12-Oct-2005	dds	Move execve's access time update functionality into a new vfs_mark_atime() function, and use the new function for performing efficient atime updates in mmap(). Reviewed by: bde MFC after: 2 weeks
151218	10-Oct-2005	tegge	Avoid unintended VMIO on directories and symlinks due to leftover object not having been destroyed.
151184	09-Oct-2005	tegge	Adjust totread argument passed to cluster_read() to account for offset not being block aligned.
151181	09-Oct-2005	tegge	Don't pretend that a failed sync write was succesful.
151180	09-Oct-2005	tegge	Reduce probability for a deadlock that can occur when a snapshot inode is updated by a process holding the snapshot lock. Another process updating a different inode in the same inodeblock will do copy on write checks and lock in the opposite direction. The snapshot code force a copy on write of these blocks manually (cf. start of expunge_ufs[12]) and these inode blocks are later put on snapblklist. This partial fix is to 'drain' the relevant ffs_copyonwrite() operation after installing new snapblklist. This is not a 100% solution since a failed block allocation can cause implicit fsync() which might deadlock before the new snapblklist has been installed.
151179	09-Oct-2005	tegge	Eliminate a deadlock that can occur when a dirty block belonging to a snapshot file is flushed by a process not holding snaplk (e.g. bufdaemon). Another process might hold snaplk and try to access the block due to ffs_copyonwrite processing.
151178	09-Oct-2005	tegge	Eliminate a deadlock that can occur during the cgaccount() processing due to the cg map buffer being held when writing indirect blocks. The process ends up in ffs_copyonwrite(), attempting to get snaplk while holding the cg map buffer lock. Another process might be in ffs_copyonwrite(), trying to allocate a new block for a copy. It would hold snaplk while trying to get the cg map buffer lock. Release the cg map buffer early and use the copy for most of the cgaccount processing to avoid this deadlock.
151177	09-Oct-2005	tegge	Reduce the probability of low block numbers passed to ffs_snapblkfree() by skipping the call from ffs_snapremove() if the block number is zero. Simplify snapshot locking in ffs_copyonwrite() and ffs_snapblkfree() by using the same locking protocol for low block numbers as for larger block numbers. This removes a lock leak that could happen if vn_lock() succeeded after lockmgr() failed in ffs_snapblkfree(). Check if snapshot is gone before retrying a lock in ffs_copyonwrite().
151176	09-Oct-2005	tegge	Reinitialize v_type and v_op fields in case vnode has been reused without reclamation. If the vnode previously was a fifo then v_op would point to ffs_fifoops[12] instead of the expected ffs_vnodeops[12], causing a panic at the end of ffsext_strategy.
150891	03-Oct-2005	truckman	Initialize the inode i_flag field in ffs_valloc() to clean up any stale flag bits left over from before the inode was recycled. Without this change, a leftover IN_SPACECOUNTED flag could prevent softdep_freefile() and softdep_releasefile() from incrementing fs_pendinginodes. Because handle_workitem_freefile() unconditionally decrements fs_pendinginodes, a negative value could be reported at file system unmount time with a message like: unmount pending error: blocks 0 files -3 The pending block count in fs_pendingblocks could also be negative for similar reasons. These errors can cause the data returned by statfs() to be slightly incorrect. Some other cleanup code in softdep_releasefile() could also be incorrectly bypassed. MFC after: 3 days
150791	01-Oct-2005	truckman	Correct previous commit to fix the sense of the TDP_NORUNNINGBUF check in ffs_copyonwrite() that is a precondition for calling waitrunningbufspace(). Pointed out by: tegge Pointy hat to: truckman MFC after: 3 days
150760	30-Sep-2005	truckman	Un-staticize waitrunningbufspace() and call it before returning from ffs_copyonwrite() if any async writes were launched. Restore the threads previous TDP_NORUNNINGBUF state before returning from ffs_copyonwrite().
150741	30-Sep-2005	truckman	Un-staticize runningbufwakeup() and staticize updateproc. Add a new private thread flag to indicate that the thread should not sleep if runningbufspace is too large. Set this flag on the bufdaemon and syncer threads so that they skip the waitrunningbufspace() call in bufwrite() rather than than checking the proc pointer vs. the known proc pointers for these two threads. A way of preventing these threads from being starved for I/O but still placing limits on their outstanding I/O would be desirable. Set this flag in ffs_copyonwrite() to prevent bufwrite() calls from blocking on the runningbufspace check while holding snaplk. This prevents snaplk from being held for an arbitrarily long period of time if runningbufspace is high and greatly reduces the contention for snaplk. The disadvantage is that ffs_copyonwrite() can start a large amount of I/O if there are a large number of snapshots, which could cause a deadlock in other parts of the code. Call runningbufwakeup() in ffs_copyonwrite() to decrement runningbufspace before attempting to grab snaplk so that I/O requests waiting on snaplk are not counted in runningbufspace as being in-progress. Increment runningbufspace again before actually launching the original I/O request. Prior to the above two changes, the system could deadlock if enough I/O requests were blocked by snaplk to prevent runningbufspace from falling below lorunningspace and one of the bawrite() calls in ffs_copyonwrite() blocked in waitrunningbufspace() while holding snaplk. See <http://www.holm.cc/stress/log/cons143.html>
150733	29-Sep-2005	truckman	After a rmdir()ed directory has been truncated, force an update of the directory's inode after queuing the dirrem that will decrement the parent directory's link count. This will force the update of the parent directory's actual link to actually be scheduled. Without this change the parent directory's actual link count would not be updated until ufs_inactive() cleared the inode of the newly removed directory, which might be deferred indefinitely. ufs_inactive() will not be called as long as any process holds a reference to the removed directory, and ufs_inactive() will not clear the inode if the link count is non-zero, which could be the result of an earlier system crash. If a background fsck is run before the update of the parent directory's actual link count has been performed, or at least scheduled by putting the dirrem on the leaf directory's inodedep id_bufwait list, fsck will corrupt the file system by decrementing the parent directory's effective link count, which was previously correct because it already took the removal of the leaf directory into account, and setting the actual link count to the same value as the effective link count after the dangling, removed, leaf directory has been removed. This happens because fsck acts based on the actual link count, which will be too high when fsck creates the file system snapshot that it references. This change has the fortunate side effect of more quickly cleaning up the large number dirrem structures that linger for an extended time after the removal of a large directory tree. It also fixes a potential problem with the shutdown of the syncer thread timing out if the system is rebooted immediately after removing a large directory tree. Submitted by: tegge MFC after: 3 days
150663	28-Sep-2005	rwatson	Back out alpha/alpha/trap.c:1.124, osf1_ioctl.c:1.14, osf1_misc.c:1.57, osf1_signal.c:1.41, amd64/amd64/trap.c:1.291, linux_socket.c:1.60, svr4_fcntl.c:1.36, svr4_ioctl.c:1.23, svr4_ipc.c:1.18, svr4_misc.c:1.81, svr4_signal.c:1.34, svr4_stat.c:1.21, svr4_stream.c:1.55, svr4_termios.c:1.13, svr4_ttold.c:1.15, svr4_util.h:1.10, ext2_alloc.c:1.43, i386/i386/trap.c:1.279, vm86.c:1.58, unaligned.c:1.12, imgact_elf.c:1.164, ffs_alloc.c:1.133: Now that Giant is acquired in uprintf() and tprintf(), the caller no longer leads to acquire Giant unless it also holds another mutex that would generate a lock order reversal when calling into these functions. Specifically not backed out is the acquisition of Giant in nfs_socket.c and rpcclnt.c, where local mutexes are held and would otherwise violate the lock order with Giant. This aligns this code more with the eventual locking of ttys. Suggested by: bde
150634	27-Sep-2005	jhb	Use the refcount API to manage the reference count for user credentials rather than using pool mutexes. Tested on: i386, alpha, sparc64
150492	23-Sep-2005	delphij	Restore a historical ufs_inactive behavior that has been changed in rev. 1.40 of ufs_inode.c, which allows an inode being truncated even when the filesystem itself is marked RDONLY. A subsequent call of UFS_TRUNCATE (ffs_truncate) would panic the system as it asserts that it can only be called when the filesystem is mounted read-write (same changeset, rev. 1.74 of sys/ufs/ffs/ffs_inode.c). Because ffs_mount() already takes care of sync'ing the filesystem to disk before being downgraded to readonly, it appears to be more desirable that we should not permit this sort of writes to disk. This change would fix a panic that occours when read-only mounted a corrupted filesystem and doing some file operations. MT6/5/4 candidate Reviewed by: mckusick
150335	19-Sep-2005	rwatson	Add GIANT_REQUIRED and WITNESS sleep warnings to uprintf() and tprintf(), as they both interact with the tty code (!MPSAFE) and may sleep if the tty buffer is full (per comment). Modify all consumers of uprintf() and tprintf() to hold Giant around calls into these functions. In most cases, this means adding an acquisition of Giant immediately around the function. In some cases (nfs_timer()), it means acquiring Giant higher up in the callout. With these changes, UFS no longer panics on SMP when either blocks are exhausted or inodes are exhausted under load due to races in the tty code when running without Giant. NB: Some reduction in calls to uprintf() in the svr4 code is probably desirable. NB: In the case of nfs_timer(), calling uprintf() while holding a mutex, or even in a callout at all, is a bad idea, and will generate warnings and potential upset. This needs to be fixed, but was a problem before this change. NB: uprintf()/tprintf() sleeping is generally a bad ideas, as is having non-MPSAFE tty code. MFC after: 1 week
150010	12-Sep-2005	tegge	Giant is no longer needed here.
149811	06-Sep-2005	csjp	Convert the primary ACL allocator from malloc(9) to using a UMA zone instead. Also introduce an aclinit function which will be used to create the UMA zone for use by file systems at system start up. MFC after: 1 month Discussed with: rwatson
149808	05-Sep-2005	tegge	Retain generation count when writing zeroes instead of an inode to disk. Don't free a struct inodedep if another process is allocating saved inode memory for the same struct inodedep in initiate_write_inodeblock_ufs[12](). Handle disappearing dependencies in softdep_disk_io_initiation(). Reviewed by: mckusick
149713	02-Sep-2005	ssouhlal	ffs_mountfs() needs devvp to be locked, so lock it. Glanced at by: phk Tested by: pjd MFC after: 3 days
149358	21-Aug-2005	ssouhlal	Set the mountpoint path in the superblock (fs_fsmnt) at mount-time so that it appears in the various messages (not cleanly unmounted, filesystem full, etc). This has been broken since rev 1.261.
149354	21-Aug-2005	tegge	Don't set the COMPLETE flag in an inodedep structure before the related inode has been written.
149178	17-Aug-2005	iedowse	In the ufsdirhash_build() failure case for corrupted directories or unreadable blocks, make sure to destroy the mutex we created. Also fix an unrelated typo in a comment. Found by: Peter Holm's stress tests Reviewed by: dwmalone MFC after: 3 days
148608	31-Jul-2005	ups	Delay freeing disk space for file system blocks until all dirty buffers are safely released. This fixes softdep problems on truncation (deletion) of files with dirty buffers. Reviewed by: jeff@, mckusick@, ps@, tegge@ Tested by: glebius@, ps@ MFC after: 3 weeks
148200	20-Jul-2005	alc	Eliminate inconsistency in the setting of the B_DONE flag. Specifically, make the b_iodone callback responsible for setting it if it is needed. Previously, it was set unconditionally by bufdone() without holding whichever lock is shared by the b_iodone callback and the corresponding top-half function. Consequently, in a race, the top-half function could conclude that operation was done before the b_iodone callback finished. See, for example, aio_physwakeup() and aio_fphysio(). Note: I don't believe that the other, more widely-used b_iodone callbacks are affected. Discussed with: jeff Reviewed by: phk MFC after: 2 weeks
147198	09-Jun-2005	ssouhlal	Allow EVFILT_VNODE events to work on every filesystem type, not just UFS by: - Making the pre and post hooks for the VOP functions work even when DEBUG_VFS_LOCKS is not defined. - Moving the KNOTE activations into the corresponding VOP hooks. - Creating a MNTK_NOKNOTE flag for the mnt_kern_flag field of struct mount that permits filesystems to disable the new behavior. - Creating a default VOP_KQFILTER function: vfs_kqfilter() My benchmarks have not revealed any performance degradation. Reviewed by: jeff, bde Approved by: rwatson, jmg (kqueue changes), grehan (mentor)
146829	31-May-2005	kensmith	This patch addresses a standards violation issue. The standards say a file's access time should be updated when it gets executed. A while ago the mechanism used to exec was changed to use a more mmap based mechanism and this behavior was broken as a side-effect of that. A new vnode flag is added that gets set when the file gets executed, and the VOP_SETATTR() vnode operation gets called. The underlying filesystem is expected to handle it based on its own semantics, some filesystems don't support access time at all. Those that do should handle it in a way that does not block, does not generate I/O if possible, etc. In particular vn_start_write() has not been called. The UFS code handles it the same way as it would normally handle the access time if a file was read - the IN_ACCESS flag gets set in the inode but no other action happens at this point. The actual time update will happen later during a sync (which handles all the necessary locking). Got me into this: cperciva Discussed with: a lot with bde, a little with kan Showed patches to: phk, jeffr, standards@, arch@ Minor discussion on: arch@
146802	30-May-2005	jeff	- Don't set our bio op to be a READ when we've just completed a write. There are subtle differences in the read and write completion path. Instead, grab an extra write ref so the write path can drop it when we recursively call bufdone(). I believe this may be the source of the wrong bufobj panics. Reported by: pho, kkenn
146356	18-May-2005	mckusick	Allow removal of empty directories with high link counts. These can occur on a filesystem running with soft updates after a crash and before a background fsck has been run. To prevent discrepancies from arising in a background fsck that may already be running, the directory is removed but its inode is not freed and is left with the residual reference count. When encountered by the background fsck it will be reclaimed.
145824	03-May-2005	jeff	- Don't restrict the softdep stats to DEBUG kernels, they cost nothing to export. This was happening anyway since this file manually sets DEBUG. - Add a sysctl for the number of items on the worklist. - Use a more canonical loop restart in softdep_fsync_mountdev, it saves some code at the expense of a goto and makes me worry less about modifying a variable that should be private to the TAILQ_FOREACH_SAFE macro.
145702	30-Apr-2005	jeff	- Use bdone() directly instead of calling it indirectly through ffs_rawreaddone(). Sponsored by: Isilon Systems, Inc.
145138	16-Apr-2005	pjd	- Plug memory leak. - Fix two style nits. Found by: Coverity Prevent analysis tool Reviewed by: rwatson MFC after: 1 week
145006	13-Apr-2005	jeff	- Change all filesystems and vfs_cache to relock the dvp once the child is locked in the ISDOTDOT case. Se vfs_lookup.c r1.79 for details. Sponsored by: Isilon Systems, Inc.
144659	05-Apr-2005	jeff	- Consistently call 'vp' vp rather than ovp sometimes in ffs_truncate(). Do the same for oip. Pointed out by: glebius
144590	03-Apr-2005	jeff	- Use M_ZERO rather than explicitly calling bzero(). - Don't intermingle direct calls to lockmgr and indirect calls through VOPs. This will be important in the future. - Dont lock the devvp's interlock just to release it on the next line by passing LK_INTERLOCK to lockmgr. - Restructure ffs_snapshot_unmount so we don't call free() with the devvp's interlock locked.
144586	03-Apr-2005	jeff	- In ffs_sync we need to pass LK_SLEEPFAIL in when we lock the vnode because it may change identities while we're sleeping on the lock. Otherwise we may bail out of ffs_sync() early due to an error from deadfs. - Collapse a VOP_UNLOCK, vrele into a single vput().
144585	03-Apr-2005	jeff	- Move the contents of softdep_disk_prewrite into ffs_geom_strategy to fix two bugs. - ffs_disk_prewrite was pulling the vp from the buf and checking for COPYONWRITE, when really it wanted the vp from the bufobj that we're writing to, which is the devvp. This lead to us skipping the copy on write to all file data, which significantly broke snapshots for the last few months. - When the SOFTUPDATES option was not included in the kernel config we would also skip the copy on write check, which would effectively disable snapshots. - Remove an invalid mp_fixme(). Debugging tips from: mckusick Reported by: iedowse, others Discussed with: phk
144376	31-Mar-2005	jeff	- Fix botched LK_NOWAIT removal. I mistakenly thought this compiled as part of GENERIC.
144375	31-Mar-2005	jeff	- FFS supports shared locks, clear LK_NOSHARE from our vnode locks. Sponsored by: Isilon Systems, Inc.
144373	31-Mar-2005	jeff	- Set LK_NOSHARE for snapshot locks. snapshots require exclusive only access. - Remove the hack from ffs_lock() to implement LK_NOSHARE in a ffs specific way. Sponsored by: Isilon Systems, Inc.
144367	31-Mar-2005	jeff	- LK_NOPAUSE is a nop now. Sponsored by: Isilon Systems, Inc.
144300	29-Mar-2005	jeff	- Remove wantparent, it is no longer necessary. An assert in vfs_lookup.c prevents any callers from doing a modifying op without LOCKPARENT or WANTPARENT. It wasn't even properly used in the CREATE or DELETE cases.
144289	29-Mar-2005	jeff	- Upgrade a shared lock request to exclusive in ffs_vget() if we have to create the vnode. Sponsored by: Isilon Systems, Inc.
144288	29-Mar-2005	jeff	- Honor the cn_lkflags passed from namei() when locking the leaf. Sponsored by: Isilon Systems, Inc.
144209	28-Mar-2005	jeff	- UFS no longer uses PDIRUNLOCK to track the parent state. Instead, we now rely on ufs to always leave the parent locked except in the ISDOTDOT case. Adjust asserts to deal with these changes. Sponsored by: Isilon Systems, Inc.
144208	28-Mar-2005	jeff	- We no longer have to bother with PDIRUNLOCK, lookup() handles it for us. Sponsored by: Isilon Systems, Inc.
144118	25-Mar-2005	das	When the softupdates worklist gets too long, threads that attempt to add more work are forced to process two worklist items first. However, processing an item may generate additional work, causing the unlucky thread to recursively process the worklist. Add a per-thread flag to detect this situation and avoid the recursion. This should fix the stack overflows that could occur while removing large directory trees. Tested by: kris Reviewed by: mckusick
144057	24-Mar-2005	jeff	- Call VFS_ROOT() with LK_EXCLUSIVE. Sponsored by: Isilon Systems, Inc.
144056	24-Mar-2005	jeff	- Update the ufs_root() prototype. - Pass the ufs_root() flags argument to VFS_VGET() to allow callers to specify shared locks. Sponsored by: Isilon Systems, Inc.
143743	17-Mar-2005	jeff	- Lock the clearing of v_data in ufs_reclaim() to prevent a pagefault in ffs_lock() when it acesses v_data without the vnlock. Sponsored by: Isilon Systems, Inc.
143692	16-Mar-2005	phk	Add two arguments to the vfs_hash() KPI so that filesystems which do not have unique hashes (NFS) can also use it.
143666	15-Mar-2005	phk	Don't hold a reference on the disk vnode for each inode.
143663	15-Mar-2005	phk	Improve the vfs_hash() API: vput() the unneeded vnode centrally to avoid replicating the vput in all the filesystems.
143619	15-Mar-2005	phk	Simplify the vfs_hash calling convention.
143613	15-Mar-2005	jeff	- Destroy the vnode object earlier in VOP_RECLAIM as we need more of the vnode valid before the vm flushes pages. - Get rid of some extraneous uses of the vnode interlock. Sponsored by: Isilon Systems, Inc.
143562	14-Mar-2005	phk	Use vfs_hash instead of home-rolled.
143504	13-Mar-2005	jeff	- It is not legal to access v_data without the vnode lock or interlock held. Grab the vnode interlock if LK_INTERLOCK has not been passed in so that we can inspect v_data in ffs_lock(). Sponsored by: Isilon Systems, Inc.
143503	13-Mar-2005	jeff	- The VI_DOOMED flag now signals the end of a vnode's relationship with the filesystem. Check that rather than VI_XLOCK. - Shorten ffs_reload by one step. The old check for an inactive vnode was slightly racey, and the code which deals with still active vnodes is not much more expensive. Sponsored by: Isilon Systems, Inc.
143502	13-Mar-2005	jeff	- The VI_DOOMED flag now signals the end of a vnode's relationship with the filesystem. Check that rather than VI_XLOCK. Sponsored by: Isilon Systems, Inc.
143501	13-Mar-2005	jeff	- Fix an assert now that the XLOCK no longer exists. Sponsored by: Isilon Systems, Inc.
143500	13-Mar-2005	jeff	- In ufs_mknod(), hold the lock across the call to vgone() as that is now required. - In ufs_close(), don't do the EAGAIN vrele hack, the top layer now calls vn_start_write before the lock is acquired as it should. Sponsored by: Isilon Systems, Inc.
143499	13-Mar-2005	jeff	- Don't drop the lock in ufs_inactive(). - Also in ufs_inactive, don't acquire the vnode interlock where it isn't strictly needed. Also owning the vnode interlock while calling vprint() will cause locking assertions to trip. Sponsored by: Isilon Systems, Inc.
142879	01-Mar-2005	jeff	- Fix anoter dyslexic moment; an atomic_set_int should've become ACTIVESET, not ACTIVECLEAR. Submitted by: iedowse
142692	27-Feb-2005	phk	Remove debug printout of major/minor numbers, print name instead.
142682	27-Feb-2005	sam	use uiomove return value instead of always returning 0 when doing a readlink of a fast link Noticed by: Coverity Prevent analysis tool Reviewed by: phk
142263	22-Feb-2005	jeff	- Add VOP locking asserts in several functions that have been implicated in recent deadlocks.
142123	20-Feb-2005	delphij	The recomputation of file system summary at mount time can be a very slow process, especially for large file systems that is just recovered from a crash. Since the summary is already re-sync'ed every 30 second, we will not lag behind too much after a crash. With this consideration in mind, it is more reasonable to transfer the responsibility to background fsck, to reduce the delay after a crash. Add a new sysctl variable, vfs.ffs.compute_summary_at_mount, to control this behavior. When set to nonzero, we will get the "old" behavior, that the summary is computed immediately at mount time. Add five new sysctl variables to adjust ndir, nbfree, nifree, nffree and numclusters respectively. Teach fsck_ffs about these API, however, intentionally not to check the existence, since kernels without these sysctls must have recomputed the summary and hence no adjustments are necessary. This change has eliminated the usual tens of minutes of delay of mounting large dirty volumes. Reviewed by: mckusick MFC After: 1 week
142079	19-Feb-2005	phk	Try to unbreak the vnode locking around vop_reclaim() (based mostly on patch from kan@). Pull bufobj_invalbuf() out of vinvalbuf() and make g_vfs call it on close. This is not yet a generally safe function, but for this very specific use it is safe. This solves the problem with buffers not being flushed by unmount or after failed mount attempts.
142074	19-Feb-2005	delphij	When clearing a fragment, it's possible that the length is zero. Reviewed by: mckusick MFC After: 1 week
141927	14-Feb-2005	jeff	- Remove the unused and unsafe ufs_ihashlookup. This function returned a vnode pointer that could not be used since no locks were held. Sponsored by: Isilon Systems, Inc.
141685	11-Feb-2005	phk	Make non-SOFTUPDATES kernels compile again. Integrate the stubfile into the main file now that license issues have been long resolved.
141631	10-Feb-2005	phk	Make a some SYSCTL_NODEs and some of FFS's VFS_ methods static.
141595	09-Feb-2005	jeff	- In the softupdates case for ffs_truncate() we use vinvalbuf() to invalidate pending io and dependencies. However, vinvalbuf() rightfully does not call vnode_pager_setsize() for us. We must do this here. This could potentially have caused numerous kinds of bugs, but it was specifically causing msync() deadlocks because msync() was writing flushing pages that should not have been valid. Sponsored by: Isilon Systems, Inc. Reported by: kkenn
141570	09-Feb-2005	phk	style polishing.
141543	08-Feb-2005	cperciva	Add a new sysctl, "security.jail.chflags_allowed", which controls the behaviour of chflags within a jail. If set to 0 (the default), then a jailed root user is treated as an unprivileged user; if set to 1, then a jailed root user is treated the same as an unjailed root user. This is necessary to allow "make installworld" to work inside a jail, since it attempts to manipulate the system immutable flag on certain files. Discussed with: csjp, rwatson MFC after: 2 weeks
141542	08-Feb-2005	phk	Split the vop_vector for ffs1 and ffs2, this is mostly for the different EXTATTR support.
141541	08-Feb-2005	phk	Use ffs_truncate() directly instead of UFS_TRUNCATE()
141539	08-Feb-2005	phk	Background writes are entirely an FFS/Softupdates thing. Give FFS vnodes a specific bufwrite method which contains all the background write stuff and then calls into the default bufwrite() for the rest of the job. Remove all the background write related stuff from the normal bufwrite. This drags the softdep_move_dependencies() back into FFS. Long term, it is worth looking at simply copying the data into allocated memory and issuing the bio directly and not create the "shadow buf" in the first place (just like copy-on-write is done in snapshots for instance). I don't think we really gain anything but complexity from doing this with a buf.
141533	08-Feb-2005	phk	Drag another softupdates tentacle back into FFS: Now that FFS's vop_fsync is separate from the internal use we can do the full job there.
141526	08-Feb-2005	phk	Don't use the UFS_* and VFS_* functions where a direct call is possble. The UFS_ functions are for UFS to call back into VFS. The VFS functions are external entry points into the filesystem.
141523	08-Feb-2005	rwatson	Don't use VOP_LEASE() with operations on extended attribute backing files. Pointed out by: phk
141522	08-Feb-2005	phk	For snapshots we need all VOP_LOCKs to be exclusive. The "business class upgrade" was implemented in UFS's VOP_LOCK implementation ufs_lock() which is the wrong layer, so move it to ffs_lock(). Also, as long as we have not abandonned advanced vfs-stacking we should not preclude it from happening: instead of implementing a copy locally, use the VOP_LOCK_APV(&ufs) to correctly arrive at vop_stdlock() at the bottom.
141521	08-Feb-2005	phk	For snapshots we need all VOP_LOCKs to be exclusive. The "business class upgrade" was implemented in UFS's VOP_LOCK implementation ufs_lock() which is the wrong layer, so move it to ffs_lock(). Also, as long as we have not abandonned advanced vfs-stacking we should not preclude it from happening: instead of implementing a copy locally, use the VOP_LOCK_APV(&ufs) to correctly arrive at vop_stdlock() at the bottom.
141520	08-Feb-2005	phk	Use VOP_STRATEGY_APV() instead of direct dereference, this is more correct.
141150	02-Feb-2005	jeff	- Use a seperate malloc tag for saved inode contents to help in debugging memory modified after free errors. Sponsored by: Isilon Systems, Inc.
141143	02-Feb-2005	kensmith	Back out previous commit, bde@ provided an example of something this breaks.
141130	02-Feb-2005	kensmith	It was noticed that we do not change a file's access time when it gets executed. This appears to violate most of the UNIX-ish standards. One example quote from: http://www.opengroup.org/onlinepubs/009695399/functions/exec.html Upon successful completion, the exec functions shall mark for update the st_atime field of the file. If an exec function failed but was able to locate the process image file, whether the st_atime field is marked for update is unspecified. Should the exec function succeed, the process image file shall be considered to have been opened with open(). This appears to take care of it for ufs filesystems, doing the necessary sanity checks (read-only filesystem, etc) without violating any other standards (setting atime for any open appears to be allowed in any standards I could find). Noticed by: cperciva Reviewed by: kan, rwatson
141085	31-Jan-2005	imp	nit in /*-
140962	29-Jan-2005	peadar	Tell vnode_create_vobject() how big an object to create, rather than having it work it out via the more expensive VOP_GETATTR Reviewed by: phk@
140939	28-Jan-2005	phk	Make filesystems get rid of their own vnodes vnode_pager object in VOP_RECLAIM().
140936	28-Jan-2005	phk	Remove unused argument to vrecycle()
140822	25-Jan-2005	phk	Introduce and use g_vfs_close().
140782	25-Jan-2005	phk	Don't use VOP_GETVOBJECT, use vp->v_object directly.
140778	24-Jan-2005	phk	Create a vnode object when the file is opened. Trust that we did so.
140774	24-Jan-2005	phk	Don't create vnode_pager objects for the disk device. geom_vfs will do that.
140768	24-Jan-2005	phk	Create a vp->v_object in VFS_FHTOVP() if we want to be exportable with NFS. We are moving responsibility for creating the vnode_pager object into the filesystems which own the vnode, and this is one of the places we have to cover. We call vnode_create_vobject() directly because we own the vnode. If we can get the size easily, pass it as an argument to save the call to VOP_GETATTR() in vnode_create_vobject()
140729	24-Jan-2005	phk	Polish style.
140709	24-Jan-2005	jeff	- Convert the global LK lock to a mutex. - Expand the scope of lk to cover not only interrupt races, but also top-half races, which includes many new uses over global top-half only data. - Get rid of interlocked_sleep() and use msleep or BUF_LOCK where appropriate. - Use the lk mutex in place of the various hand rolled semaphores. - Stop dropping the lk lock before we panic. - Fix getdirtybuf() callers so that they reacquire access to whatever softdep datastructure they were inxpecting in the failure/retry case. Previously, sleeps in getdirtybuf() could leave us with pointers to bad memory. - Update handling of ffs to be compatible with ffs locking changes. Sponsored By: Isilon Systems, Inc.
140708	24-Jan-2005	jeff	- Initialize and destroy the per-filesystem ufs lock where appropriate. - Use the buffer lock on the superblock buf to serialize calls to sbupdate. - Set the MNTK_MPSAFE flag when QUOTA is not defined in the kernel. Sponsored By: Isilon Systems, Inc.
140707	24-Jan-2005	jeff	- Remove GIANT_REQUIRED where giant is no longer required. Sponsored By: Isilon Systems, Inc.
140706	24-Jan-2005	jeff	- Use the ufs lock to protect fs_active. Sponsored By: Isilon Systems, Inc.
140705	24-Jan-2005	jeff	- Acquire the ufs lock around several ffs_alloc functions that require it. Sponsored By: Isilon Systems, Inc.
140704	24-Jan-2005	jeff	- Don't use atomic operations to deal with the active array, instead it is now quite naturally protected by the ufsmount mutex. - Use the ufs lock to protect various fields in struct fs, primarily the cg summary needs protection to avoid allocation races. Several functions have been slightly re-arranged to reduce the number of lock operations. - Adjust several functions (blkfree, freefile, etc.) to accept a ufsmount as an argument so that we may access the ufs lock. Sponsored By: Isilon Systems, Inc.
140703	24-Jan-2005	jeff	- Acquire the ufs lock when manipulating some fields of struct fs. - Change arguments to various ffs functions to match their new prototypes. Sponsored By: Isilon Systems, Inc.
140702	24-Jan-2005	jeff	- Mark the struct fs members that require the ufsmount mutex. - Define some macros for manipulating the fs_active bitmap. Sponsored By: Isilon Systems, Inc.
140701	24-Jan-2005	jeff	- Change some function parameters so that the ufsmount structure is accessable in places where the ufs lock will be needed. Sponsored By: Isilon Systems, Inc.
140700	24-Jan-2005	jeff	- Add a mutex to the ufsmount structure. This mutex is used to protect any per-instance global data that is not already protected by a buf or vnode lock. Presently, only fields in ffs's struct fs utilize this lock. - Sort some ufsmount members so that fields used for quotas are grouped together. This is in anticipation of quota locking. Sponsored By: Isilon Systems, Inc.
140306	15-Jan-2005	pjd	Fix ACLs handling for the root file system. Without this fix, when ACLs are set via tunefs(8) on the root file system, they are removed on boot when 'mount -a' is called, because mount(8) called for the root file system always add MNT_UPDATE flag and MNT_UPDATE flag isn't perfect. Now, one cannot remove ACLs stored in superblock (configured with tunefs(8)) via 'mount -a' nor 'mount -u -o noacls <file system>', but it is still possible to mount file system which doesn't have ACLs in superblock via 'mount -o acls <file system>' or /etc/fstab's 'acls' option. Reported by: Lech Lorens/pl.comp.os.bsd Discussed with: phk, rwatson Reviewed by: rwatson MFC after: 2 weeks
140220	14-Jan-2005	phk	Eliminate unused and unnecessary "cred" argument from vinvalbuf()
140181	13-Jan-2005	phk	Ditch vfs_object_create() and make the callers call VOP_CREATEVOBJECT() directly.
140056	11-Jan-2005	phk	Add BO_SYNC() and add a default which uses the secret vnode pointer and VOP_FSYNC() for now.
140051	11-Jan-2005	phk	Wrap the bufobj operations in macros: BO_STRATEGY() and BO_WRITE()
140048	11-Jan-2005	phk	Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC(). I'm not sure why a credential was added to these in the first place, it is not used anywhere and it doesn't make much sense: The credentials for syncing a file (ability to write to the file) should be checked at the system call level. Credentials for syncing one or more filesystems ("none") should be checked at the system call level as well. If the filesystem implementation needs a particular credential to carry out the syncing it would logically have to the cached mount credential, or a credential cached along with any delayed write data. Discussed with: rwatson
139825	07-Jan-2005	imp	/* -> /*- for license, minor formatting changes
138869	14-Dec-2004	phk	white space
138868	14-Dec-2004	phk	Implement simpler panics for VOP_{read,write} on fifos.
138814	13-Dec-2004	imp	LINT defines things which compile in code that as referring to the old a_desc element. change this to the new a_gen.a_desc to reflect changes to vnode_if.h generation. Noticed by: tinderbox, phk
138744	12-Dec-2004	phk	With the introduction of UFS2 we started looking for superblocks in four different locations on a prospective filesystem. If we found none, we forgot to invalidate the four buffers, thus the following sequence would fails: (md0 = blank disk) mount /dev/md0 /mnt (fails, no superblocks) newfs /dev/md0 (writes using physio which does not go through buffercache). mount /dev/md0 /mnt (still fails, the four cached buffers still contain no superblocks) Found by: ru
138700	11-Dec-2004	marcel	Revert previous commit. The null-pointer function call (a dereference on ia64) was not the result of a change in the vector operations. It was caused by the NFS locking code using a FIFO and those bypassing the vnode. This indirectly caused the panic. The NFS locking code has been changed. Requested by: phk
138634	09-Dec-2004	mckusick	Fixes a bug that caused UFS2 filesystems bigger than 2TB to prematurely report that they were full and/or to panic the kernel with the message ``ffs_clusteralloc: allocated out of group''. Submitted by: Henry Whincup <henry@jot.to> MFC after: 1 week
138557	08-Dec-2004	phk	Fix snapshot creation.
138517	07-Dec-2004	phk	Fix nfs exports (for now). The real fix is to teach mountd about nmount.
138509	07-Dec-2004	phk	The remaining part of nmount/omount/rootfs mount changes. I cannot sensibly split the conversion of the remaining three filesystems out from the root mounting changes, so in one go: cd9660: Convert to nmount. Add omount compat shims. Remove dedicated rootfs mounting code. Use vfs_mountedfrom() Rely on vfs_mount.c calling VFS_STATFS() nfs(client): Convert to nmount (the simple way, mount_nfs(8) is still necessary). Add omount compat shims. Drop COMPAT_PRELITE2 mount arg compatibility. ffs: Convert to nmount. Add omount compat shims. Remove dedicated rootfs mounting code. Use vfs_mountedfrom() Rely on vfs_mount.c calling VFS_STATFS() Remove vfs_omount() method, all filesystems are now converted. Remove MNTK_WANTRDWR, handling RO/RW conversions is a filesystem task, and they all do it now. Change rootmounting to use DEVFS trampoline: vfs_mount.c: Mount devfs on /. Devfs needs no 'from' so this is clean. symlink /dev to /. This makes it possible to lookup /dev/foo. Mount "real" root filesystem on /. Surgically move the devfs mountpoint from under the real root filesystem onto /dev in the real root filesystem. Remove now unnecessary getdiskbyname(). kern_init.c: Don't do devfs mounting and rootvnode assignment here, it was already handled by vfs_mount.c. Remove now unused bdevvp(), addaliasu() and addalias(). Put the few necessary lines in devfs where they belong. This eliminates the second-last source of bogo vnodes, leaving only the lemming-syncer. Remove rootdev variable, it doesn't give meaning in a global context and was not trustworth anyway. Correct information is provided by statfs(/).
138412	05-Dec-2004	phk	VFS_STATFS(mp, ...) is mostly called with &mp->mnt_stat, but a few cases doesn't. Most of the implementations have grown weeds for this so they copy some fields from mnt_stat if the passed argument isn't that. Fix this the cleaner way: Always call the implementation on mnt_stat and copy that in toto to the VFS_STATFS argument if different.
138411	05-Dec-2004	marcel	Fix null-pointer indirect function calls introduced in the previous commit. In the new world order, the transitive closure on the vector operations is not precomputed. As such, it's unsafe to actually use any of the function pointers in an indirect function call. They can be null, and we need to use the default vector in that case. This is mostly a quick fix for the four function pointers that are ed explicitly. A more generic or scalable solution is likely to see the light of day. No pathos on: current@
138359	03-Dec-2004	phk	typo in comment.
138290	01-Dec-2004	phk	Back when VOP_* was introduced, we did not have new-style struct initializations but we did have lofty goals and big ideals. Adjust to more contemporary circumstances and gain type checking. Replace the entire vop_t frobbing thing with properly typed structures. The only casualty is that we can not add a new VOP_ method with a loadable module. History has not given us reason to belive this would ever be feasible in the the first place. Eliminate in toto VOCALL(), vop_t, VNODEOP_SET() etc. Give coda correct prototypes and function definitions for all vop_()s. Generate a bit more data from the vnode_if.src file: a struct vop_vector and protype typedefs for all vop methods. Add a new vop_bypass() and make vop_default be a pointer to another struct vop_vector. Remove a lot of vfs_init since vop_vector is ready to use from the compiler. Cast various vop_mumble() to void * with uppercase name, for instance VOP_PANIC, VOP_NULL etc. Implement VCALL() by making vdesc_offset the offsetof() the relevant function pointer in vop_vector. This is disgusting but since the code is generated by a script comparatively safe. The alternative for nullfs etc. would be much worse. Fix up all vnode method vectors to remove casts so they become typesafe. (The bulk of this is generated by scripts)
138270	01-Dec-2004	phk	Mechanically change prototypes for vnode operations to use the new typedefs.
138075	25-Nov-2004	phk	Use system wide no-op vfs_start function.
137846	18-Nov-2004	jeff	- Eliminate the acquisition and release of the bqlock in bremfree() by setting the B_REMFREE flag in the buf. This is done to prevent lock order reversals with code that must call bremfree() with a local lock held. This also reduces overhead by removing two lock operations per buf for fsync() and similar. - Check for the B_REMFREE flag in brelse() and bqrelse() after the bqlock has been acquired so that we may remove ourself from the free-list. - Provide a bremfreef() function to immediately remove a buf from a free-list for use only by NFS. This is done because the nfsclient code overloads the b_freelist queue for its own async. io queue. - Simplify the numfreebuffers accounting by removing a switch statement that executed the same code in every possible case. - getnewbuf() can encounter locked bufs on free-lists once Giant is removed. Remove a panic associated with this condition and delay asserts that inspect the buf until after it is locked. Reviewed by: phk Sponsored by: Isilon Systems, Inc.
137726	15-Nov-2004	phk	Make VOP_BMAP return a struct bufobj for the underlying storage device instead of a vnode for it. The vnode_pager does not and should not have any interest in what the filesystem uses for backend. (vfs_cluster doesn't use the backing store argument.)
137657	13-Nov-2004	phk	Be prepared to accept NULL mountargs as part of root-mounting.
137608	12-Nov-2004	phk	Put back the vfs_object_create() calls, they do make a difference when my test-setup does what I want it to instead of what I ask it to. Pointed out by: tegge
137504	10-Nov-2004	phk	fix some comments
137491	09-Nov-2004	phk	Use mount flags instead of NULL path to detect root filesystem mount.
137486	09-Nov-2004	phk	Stop pretending to have a vm_object backing the underlying disk vnode: it isn't used for anything anywhere and the vnode_pager would explode if we attempted to.
137308	06-Nov-2004	phk	Properly implement a default version of VOP_GETWRITEMOUNT. Remove improper access to vop_stdgetwritemount() which should and will instead rely on the VOP default path.
137194	04-Nov-2004	phk	Don't grab the exclusive bit on a root filesystem until we are willing to mount it. Doing so prevented fsck to be run after a refused mount.
137035	29-Oct-2004	phk	Move UFS from DEVFS backing to GEOM backing. This eliminates a bunch of vnode overhead (approx 1-2 % speed improvement) and gives us more control over the access to the storage device. Access counts on the underlying device are not correctly tracked and therefore it is possible to read-only mount the same disk device multiple times: syv# mount -p /dev/md0 /var ufs rw 2 2 /dev/ad0 /mnt ufs ro 1 1 /dev/ad0 /mnt2 ufs ro 1 1 /dev/ad0 /mnt3 ufs ro 1 1 Since UFS/FFS is not a synchrousely consistent filesystem (ie: it caches things in RAM) this is not possible with read-write mounts, and the system will correctly reject this. Details: Add a geom consumer and a bufobj pointer to ufsmount. Eliminate the vnode argument from softdep_disk_prewrite(). Pick the vnode out of bp->b_vp for now. Eventually we should find it through bp->b_bufobj->b_private. In the mountcode, use g_vfs_open() once we have used VOP_ACCESS() to check permissions. When upgrading and downgrading between r/o and r/w do the right thing with GEOM access counts. Remove all the workarounds for not being able to do this with VOP_OPEN(). If we are the root mount, drop the exclusive access count until we upgrade to r/w. This allows fsck of the root filesystem and the MNT_RELOAD to work correctly. Set bo_private to the GEOM consumer on the device bufobj. Change the ffs_ops->strategy function to call g_vfs_strategy() In ufs_strategy() directly call the strategy on the disk bufobj. Same in rawread. In ffs_fsync() we will no longer see VCHR device nodes, so remove code which synced the filesystem mounted on it, in case we came there. I'm not sure this code made sense in the first place since we would have taken the specfs route on such a vnode. Redo the highly bogus readblock() function in the snapshot code to something slightly less bogus: Constructing an uio and using physio was really quite a detour. Instead just fill in a bio and ship it down.
137007	28-Oct-2004	phk	We only support backing UFS/FFS with disks.
136988	27-Oct-2004	phk	Eliminate unnecessary KASSERTS.
136982	26-Oct-2004	phk	KASSERT that we only get to prewrite() on writes.
136981	26-Oct-2004	phk	White space changes. Add missing static.
136980	26-Oct-2004	phk	Replace single case switch() with if().
136979	26-Oct-2004	phk	Vertically align comment.
136969	26-Oct-2004	phk	The island council met and voted buf_prewrite() home. Give ffs it's own bufobj->bo_ops vector and create a private strategy routine, (currently misnamed for forwards compatibility), which is just a copy of the generic bufstrategy routine except we call softdep_disk_prewrite() directly instead of through the buf_prewrite() indirection. Teach UFS about the need for softdep_disk_prewrite() and call the function directly in FFS. Remove buf_prewrite() from the default bufstrategy() and from the global bio_ops method vector.
136968	26-Oct-2004	phk	Fix syntax errors introduced by last commit. Why isn't DIRECTIO in NOTES/LINT ?
136966	26-Oct-2004	phk	Put the I/O block size in bufobj->bo_bsize. We keep si_bsize_phys around for now as that is the simplest way to pull the number out of disk device drivers in devfs_open(). The correct solution would be to do an ioctl(DIOCGSECTORSIZE), but the point is probably mooth when filesystems sit on GEOM, so don't bother for now.
136963	26-Oct-2004	phk	Degeneralize the per cdev copyonwrite callback. The only possible value is ffs_copyonwrite() and the only place it can be called from is FFS which would never want to call another filesystems copyonwrite method, should one exist, so there is no reason why anything generic should know about this.
136943	25-Oct-2004	phk	Loose the v_dirty* and v_clean* alias macros. Check the count field where we just want to know the full/empty state, rather than using TAILQ_EMPTY() or TAILQ_FIRST().
136941	25-Oct-2004	phk	Remove vnode->v_bsize. This was a dead-end.
136927	24-Oct-2004	phk	Move the buffer method vector (buf->b_op) to the bufobj. Extend it with a strategy method. Add bufstrategy() which do the usual VOP_SPECSTRATEGY/VOP_STRATEGY song and dance. Rename ibwrite to bufwrite(). Move the two NFS buf_ops to more sensible places, add bufstrategy to them. Add inlines for bwrite() and bstrategy() which calls through buf->b_bufobj->b_ops->b_{write,strategy}(). Replace almost all VOP_STRATEGY()/VOP_SPECSTRATEGY() calls with bstrategy().
136767	22-Oct-2004	phk	Add b_bufobj to struct buf which eventually will eliminate the need for b_vp. Initialize b_bufobj for all buffers. Make incore() and gbincore() take a bufobj instead of a vnode. Make inmem() local to vfs_bio.c Change a lot of VI_[UN]LOCK(bp->b_vp) to BO_[UN]LOCK(bp->b_bufobj) also VI_MTX() to BO_MTX(), Make buf_vlist_add() take a bufobj instead of a vnode. Eliminate other uses of bp->b_vp where bp->b_bufobj will do. Various minor polishing: remove "register", turn panic into KASSERT, use new function declarations, TAILQ_FOREACH_SAFE() etc.
136751	21-Oct-2004	phk	Move the VI_BWAIT flag into no bo_flag element of bufobj and call it BO_WWAIT Add bufobj_wref(), bufobj_wdrop() and bufobj_wwait() to handle the write count on a bufobj. Bufobj_wdrop() replaces vwakeup(). Use these functions all relevant places except in ffs_softdep.c where the use if interlocked_sleep() makes this impossible. Rename b_vnbufs to b_bobufs now that we touch all the relevant files anyway.
136721	20-Oct-2004	rwatson	Explicitly break out NETA license from Berkeley license to clearly indicate license grant, as well as to indicate that NETA is asserting only two clauses, not four clauses. Requested by: imp
136336	09-Oct-2004	njl	Fix fsbtodb() for UFS1. This fixes an overflow for file sizes >1 TB, allowing for sizes up to 4 TB. This doesn't affect UFS2 since b is already a 64 bit type, coincidental with daddr_t. Submitted by: bde
136144	05-Oct-2004	pjd	Back out changes which were introduced to delay mounting root file system. Those changes were made on gmirror needs, but now gmirror handles this by itself.
135877	28-Sep-2004	phk	Remove support for accessing device nodes in UFS/FFS. Device nodes can still be created and exported with NFS.
135858	27-Sep-2004	phk	Give cluster_write() an explicit vnode argument. In the future a struct buf will not automatically point out a vnode for us.
135612	23-Sep-2004	pjd	Introduce new /boot/loader.conf variable: root_mount_delay. It can be used to delay mounting root partition to give a chance to GEOM providers to show up. Now, when there is no needed provider, vfs_rootmount() function will look for it every second and if it can't be find in defined time, it'll ask for root device name (before this change it was done immediately). This will allow to boot from gmirror device in degraded mode.
135459	19-Sep-2004	phk	The getpages VOP was a good stab at getting scatter/gather I/O without too much kernel copying, but it is not the right way to do it, and it is in the way for straightening out the buffer cache. The right way is to pass the VM page array down through the struct bio to the disk device driver and DMA directly in to/out off the physical memory. Once the VM/buf thing is sorted out it is next on the list. Retire most of vnode method. ffs_getpages(). It is not clear if what is left shouldn't be in the default implementation which we now fall back to. Retire specfs_getpages() as well, as it has no users now.
135312	16-Sep-2004	phk	Do not traverse list of snapshots if there isn't one. Found by: scottl
135303	16-Sep-2004	phk	Missed a place where snapshots were allocated in my last commit to this file.
135138	13-Sep-2004	phk	Create struct snapdata which contains the snapshot fields from cdev and the previously malloc'ed snapshot lock. Malloc struct snapdata instead of just the lock. Replace snapshot fields in cdev with pointer to snapdata (saves 16 bytes). While here, give the private readblock() function a vnode argument in preparation for moving UFS to access GEOM directly.
135135	13-Sep-2004	phk	Remove the buffercache/vnode side of BIO_DELETE processing in preparation for integration of p4::phk_bufwork. In the future, local filesystems will talk to GEOM directly and they will consequently be able to issue BIO_DELETE directly. Since the removal of the fla driver, BIO_DELETE has effectively been a no-op anyway.
134899	07-Sep-2004	phk	Create simple function init_va_filerev() for initializing a va_filerev field. Replace three instances of longhaired initialization va_filerev fields. Added XXX comment wondering why we don't use random bits instead of uptime of the system for this purpose.
134143	22-Aug-2004	csjp	Currently, if the secure level is low enough, system flags can be manipulated by prison root. In 4.x prison root can not manipulate system flags, regardless of the security level. This behavior should remain consistent to avoid any surprises which could lead to security problems for system administrators which give out privileged access to jails. This commit changes suser_cred's flag argument from SUSER_ALLOWJAIL to 0. This will prevent prison root from being able to manipulate system flags on files. This may be a MFC candidate for RELENG_5. Discussed with: cperciva Reviewed by: rwatson Approved by: bmilekic (mentor) PR: kern/70298
134011	19-Aug-2004	jhb	Generalize the UFS bad magic value used to determine when a filesystem has only been partly initialized via newfs(8) so that it applies to both UFS1 and UFS2. Submitted by: "Xin LI" delphij at frontfree dot net MFC: maybe?
133837	16-Aug-2004	dwmalone	When looking for some extra data to include in the hash, use the address of the dirhash, rather than the first sizeof(struct dirhash *) bytes of the structure (which, thankfully, seem to be constant). Submitted by: Ted Unangst <tedu@zeitbombe.org> MFC after: 2 weeks
133741	15-Aug-2004	jmg	Add locking to the kqueue subsystem. This also makes the kqueue subsystem a more complete subsystem, and removes the knowlege of how things are implemented from the drivers. Include locking around filter ops, so a module like aio will know when not to be unloaded if there are outstanding knotes using it's filter ops. Currently, it uses the MTX_DUPOK even though it is not always safe to aquire duplicate locks. Witness currently doesn't support the ability to discover if a dup lock is ok (in some cases). Reviewed by: green, rwatson (both earlier versions)
133327	08-Aug-2004	phk	use bufdone() not biodone().
132902	30-Jul-2004	phk	Put a version element in the VFS filesystem configuration structure and refuse initializing filesystems with a wrong version. This will aid maintenance activites on the 5-stable branch. s/vfs_mount/vfs_omount/ s/vfs_nmount/vfs_mount/ Name our filesystems mount function consistently. Eliminate the namiedata argument to both vfs_mount and vfs_omount. It was originally there to save stack space. A few places abused it to get hold of some credentials to pass around. Effectively it is unused. Reorganize the root filesystem selection code.
132805	28-Jul-2004	phk	Remove global variable rootdevs and rootvp, they are unused as such. Add local rootvp variables as needed. Remove checks for miniroot's in the swappartition. We never did that and most of the filesystems could never be used for that, but it had still been copy&pasted all over the place.
132775	28-Jul-2004	kan	Avoid using casts as lvalues. Introduce DIP_SET macro which sets proper inode field based on UFS version. Use DIP ro read values and DIP_SET to modify them throughout FFS code base.
132653	26-Jul-2004	cperciva	Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is somewhat clearer, but more importantly allows for a consistent naming scheme for suser_cred flags. The old name is still defined, but will be removed in a few days (unless I hear any complaints...) Discussed with: rwatson, scottl Requested by: jhb
132154	14-Jul-2004	phk	Make sure to update the mnt_stats before UFS1 extattr tried to do I/O on the device. Otherwise the blocksize is undefined in the buffer cache.
132023	12-Jul-2004	alfred	Make VFS_ROOT() and vflush() take a thread argument. This is to allow filesystems to decide based on the passed thread which vnode to return. Several filesystems used curthread, they now use the passed thread.
131907	10-Jul-2004	marcel	Update for the KDB debugger framework: o Make debugging code conditional upon KDB. o Use kdb_backtrace() instead of backtrace(). o Remove inclusion of opt_ddb.h.
131756	07-Jul-2004	phk	Explicity initialize vp->v_bsize.
131551	04-Jul-2004	phk	When we traverse the vnodes on a mountpoint we need to look out for our cached 'next vnode' being removed from this mountpoint. If we find that it was recycled, we restart our traversal from the start of the list. Code to do that is in all local disk filesystems (and a few other places) and looks roughly like this: MNT_ILOCK(mp); loop: for (vp = TAILQ_FIRST(&mp...); (vp = nvp) != NULL; nvp = TAILQ_NEXT(vp,...)) { if (vp->v_mount != mp) goto loop; MNT_IUNLOCK(mp); ... MNT_ILOCK(mp); } MNT_IUNLOCK(mp); The code which takes vnodes off a mountpoint looks like this: MNT_ILOCK(vp->v_mount); ... TAILQ_REMOVE(&vp->v_mount->mnt_nvnodelist, vp, v_nmntvnodes); ... MNT_IUNLOCK(vp->v_mount); ... vp->v_mount = something; (Take a moment and try to spot the locking error before you read on.) On a SMP system, one CPU could have removed nvp from our mountlist but not yet gotten to assign a new value to vp->v_mount while another CPU simultaneously get to the top of the traversal loop where it finds that (vp->v_mount != mp) is not true despite the fact that the vnode has indeed been removed from our mountpoint. Fix: Introduce the macro MNT_VNODE_FOREACH() to traverse the list of vnodes on a mountpoint while taking into account that vnodes may be removed from the list as we go. This saves approx 65 lines of duplicated code. Split the insmntque() which potentially moves a vnode from one mount point to another into delmntque() and insmntque() which does just what the names say. Fix delmntque() to set vp->v_mount to NULL while holding the mountpoint lock.
131072	24-Jun-2004	rwatson	Annotate that we don't check the returned data length from ufs_readdir() because UFS uses fixed-size directory blocks. When using this code with other file systems, such as HFS+, the value of auio.uio_resid will need to be taken into account.
131069	24-Jun-2004	rwatson	Remove unnecessary setting of VV_SYSTEM on extended attribute backing files. When this flag is used in our port of this code to Darwin, it caused remarkable pain, and doesn't offer a benefit in FreeBSD.
131067	24-Jun-2004	rwatson	Protect a non-text comment with a '-'.
131066	24-Jun-2004	rwatson	White space cleanup: use spaces instead of tabs in variable declarations local to a function. Remove a couple of blank lines in variable declarations. In one case, explicitly test against NULL rather than using a pointer as a boolean directly.
130761	20-Jun-2004	bde	Backed out previous commit. The dev_t -> `struct cdev ' changes have lots of errors. Blind substitution of "dev_t foo" by "struct cdev foo" in comments usually just created an English syntax error (e.g., "struct cdev *changes"), but here it did less than that since the dev_t is a user dev_t.
130690	18-Jun-2004	kuriyama	Avoid deadlock which is caused by locking VDIR of parent and VREG of snapshot itself in wrong order. We can skip unlink check of that directory because it must have snapshot in it. Reviewed by: mckusick and current@
130585	16-Jun-2004	phk	Do the dreaded s/dev_t/struct cdev */ Bump __FreeBSD_version accordingly.
130551	16-Jun-2004	julian	Nice, is a property of a process as a whole.. I mistakenly moved it to the ksegroup when breaking up the process structure. Put it back in the proc structure.
130246	08-Jun-2004	stefanf	Avoid assignments to cast expressions. Reviewed by: md5 Approved by: das (mentor)
130023	03-Jun-2004	tjr	Move TDF_DEADLKTREAT into td_pflags (and rename it accordingly) to avoid having to acquire sched_lock when manipulating it in lockmgr(), uiomove(), and uiomove_fromphys(). Reviewed by: jhb
129895	31-May-2004	krion	- Fix typo Approved by: tobez
129545	21-May-2004	kensmith	Upon further review it was decided this piece of the msync(2) fixes was applicable to HEAD, originally it was thought this should only be done in RELENG_4. Implement IO_INVAL in the vnode op for writing by marking the buffer as "no cache". This fix has already been applied to RELENG_4 as Rev. 1.65.2.15 of ufs/ufs/ufs_readwrite.c. Reviewed by: alc, tegge
129450	19-May-2004	kensmith	Style fixup in previous commit. Noticed by: bde (thanks!)
129244	14-May-2004	kensmith	Change ffs_realloccg() to set the valid bits for the extended part of the fragment to zero the valid parts of a VM_IO buffer. RE would like this to be part of 4.10-RC3 so this will be MFC-ed immediately. Reviewed by: alc, tegge
128740	29-Apr-2004	bmilekic	Revert previous change to this file because it breaks some things which compare /etc/fstab entries to results from getfsstat(). The real way to fix this is to make 'ufs2' a recognized filesystem (for real, no beating around the bush). This should fix things like 'umount -a -t ufs' now. Appologies for the previous breakage.
128658	26-Apr-2004	bmilekic	The previous change to mount(8) to report ufs or ufs2 used libufs, which only works for Charlie root. This change reverts the introduction of libufs and moves the check into the kernel. Since the f_fstypename is the same for both ufs and ufs2, we check fs_magic for presence of ufs2 and copy "ufs2" explicitly instead. Submitted by: Christian S.J. Peron <maneo@bsdpro.com>
128006	07-Apr-2004	bde	Record where half the bits in this file came from (from ufs_readwrite.c). Damage to history from moving bits was especially large since a repo copy is not feasible for partial files.
127975	07-Apr-2004	imp	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999 and irc message from Robert Watson saying that clause 3 can be removed from those files with an NAI copyright that also have only a University of California copyrights. Approved by: core, rwatson
127955	06-Apr-2004	jhb	Fix a paste-o from the buf_prewrite() cleanup commit and check for the MNTK_SUSPEND flag on the correct vnode pointer in softdep_disk_prewrite(). Reviewed by: phk Tested by: kensmith
127818	03-Apr-2004	mux	Fix the remaining warnings of growfs(8) on my sparc64 box with WARNS=6. I don't change the WARNS level in the Makefile because I didn't tested this on other archs. The fs.h fix was suggested by: marcel Reviewed by: md5(1)
127095	16-Mar-2004	kan	Avoid doing bawrite to initialize inode block while holding cylinder group block locked. If filesystem has any active snapshots, bawrite can come back trying to allocate new snapshot data block from the same cylinder group and cause panic due to recursive lock attempt. PR: 64206 Reviewed by: mckusick Tested by: pjd
126858	11-Mar-2004	phk	When I was a kid my work table was one cluttered mess an cleaning it up were a rather overwhelming task. I soon learned that if you don't know where you're going to store something, at least try to pile it next to something slightly related in the hope that a pattern emerges. Apply the same principle to the ffs/snapshot/softupdates code which have leaked into specfs: Add yet a buf-quasi-method and call it from the only two places I can see it can make a difference and implement the magic in ffs_softdep.c where it belongs. It's not pretty, but at least it's one less layer violated.
126853	11-Mar-2004	phk	Properly vector all bwrite() and BUF_WRITE() calls through the same path and s/BUF_WRITE()/bwrite()/ since it now does the same as bwrite().
126170	23-Feb-2004	mckusick	A more accurate test in the new ufs_lock than that in 1.235.
126154	23-Feb-2004	mckusick	In the function clear_inodedeps(), a FREE_LOCK() should be called AFTER the call to vn_start_write(), not before it. Otherwise, it is possible to unlock it multiple times if the vn_start_write() fails. Submitted by: Juergen Hannken-Illjes <hannken@eis.cs.tu-bs.de>
126153	23-Feb-2004	mckusick	Change UFS from using vop_stdlock to using its own ufs_lock. In ufs_lock, check for attempts to acquire shared locks on snapshot files and change them to be exclusive locks. This change eliminates deadlocks and machine lockups reported in -current since most read requests started using shared lock requests. Submitted by: Jun Kuriyama <kuriyama@imgsrc.co.jp>
126097	22-Feb-2004	rwatson	Update my personal copyrights and NETA copyrights in the kernel to use the "year1-year3" format, as opposed to "year1, year2, year3". This seems to make lawyers more happy, but also prevents the lines from getting excessively long as the years start to add up. Suggested by: imp
125854	15-Feb-2004	dwmalone	Abstract dirhash's locking using macros. This should make it easier to use the same dirhash code on different branches/platforms. Reviewed by: Ted Unangst <tedu@zeitbombe.org> Reviewed by: iedowse MFC after: 3 weeks
125796	14-Feb-2004	bde	Fixed some style bugs: - don't unlock the vnode after vinvalbuf() only to have to relock it almost immediately. - don't refer to devices classified by vn_isdisk() as block devices.
125765	13-Feb-2004	bde	MFextfs: backed out secondary changes in rev.1.40 that had become just style bugs (a variable that is used only once, and misformattings).
125764	13-Feb-2004	kuriyama	Fix style bugs in previous commit. Submitted by: bde
125738	12-Feb-2004	bde	Fixed some minor style bugs (English usage and formatting of binary operators) in and near revs.1.169-1.170 (open mode bandaid). This (or better a proper fix) should have been done before cloning the bandaid to many other file systems.
125732	12-Feb-2004	kuriyama	Reverse lock order by using local variable. This will shut up "acquiring duplicate lock of same type" message. Reviewed by: mckusick
125710	11-Feb-2004	bde	Removed more vestiges of vfs_ioopt: - rev.1.42 of ffs_readwrite.c added a special case in ffs_read() for reads that are initially at EOF, and rev.1.62 of ufs_readwrite.c fixed timestamp bugs in it. Removal of most of vfs_ioopt made it just and optimization, and removal of the vm object reference calls made it less than an optimization. It was cloned in rev.1.94 of ufs_readwrite.c as part of cloning ffs_extwrite() although it was always less than an optimization in ffs_extwrite(). - some comments, compound statements and vertical whitespace were vestiges of dead code.
125454	04-Feb-2004	jhb	Locking for the per-process resource limits structure. - struct plimit includes a mutex to protect a reference count. The plimit structure is treated similarly to struct ucred in that is is always copy on write, so having a reference to a structure is sufficient to read from it without needing a further lock. - The proc lock protects the p_limit pointer and must be held while reading limits from a process to keep the limit structure from changing out from under you while reading from it. - Various global limits that are ints are not protected by a lock since int writes are atomic on all the archs we support and thus a lock wouldn't buy us anything. - All accesses to individual resource limits from a process are abstracted behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return either an rlimit, or the current or max individual limit of the specified resource from a process. - dosetrlimit() was renamed to kern_setrlimit() to match existing style of other similar syscall helper functions. - The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit() (it didn't used the stackgap when it should have) but uses lim_rlimit() and kern_setrlimit() instead. - The svr4 compat no longer uses the stackgap for resource limits calls, but uses lim_rlimit() and kern_setrlimit() instead. - The ibcs2 compat no longer uses the stackgap for resource limits. It also no longer uses the stackgap for accessing sysctl's for the ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result, ibcs2_sysconf() no longer needs Giant. - The p_rlimit macro no longer exists. Submitted by: mtm (mostly, I only did a few cleanups and catchups) Tested on: i386 Compiled on: alpha, amd64
125259	31-Jan-2004	alc	Remove unnecessary vm object reference and deallocate calls from ffs_read() and ffs_write(). These calls trace their origins to the dead vfs_ioopt code, first appearing in revision 1.39 of ufs_readwrite.c. Observed by: bde Discussed with: tegge
125079	27-Jan-2004	ache	Turn uio_resid/uio_offset comments into KASSERTs Reviewed by: bde
124857	23-Jan-2004	ache	Copy comment about caller check from ffs_read to ffs_extread, don't check for uio_resid < 0 here too.
124856	23-Jan-2004	ache	Fix various panic() strings to reflect true function name to allow easy grep. Small code reorganization to look more logic. Copy ffs_write check from prev. commit to ffs_extwrite.
124855	23-Jan-2004	ache	ffs_read: Replace wrong check returned EFBIG with EOVERFLOW handling from POSIX: 36708 [EOVERFLOW] The file is a regular file, nbyte is greater than 0, the starting position is before the end-of-file, and the starting position is greater than or equal to the offset maximum established in the open file description associated with fildes. ffs_write: Replace u_int64_t cast with uoff_t cast which is more natural for types used. ffs_write & ffs_read: Remove uio_offset and uio_resid checks for negative values, the caller supposed to do it already. Add comments about it. Reviewed by: bde
124728	19-Jan-2004	kan	Spell magic '16' number as IO_SEQSHIFT.
124119	04-Jan-2004	kan	Avoid calling vprint on a vnode while holding its interlock mutex. Move diagnostic printf after vget. This might delay the debug output some, but at least it keeps kernel from exploding if DEBUG_VFS_LOCKS is in effect.
123217	07-Dec-2003	truckman	Set fs_ronly to the correct value in ffs_reload() when reloading the file system super block after fsck has repaired the file system. The value of fs_ronly was getting overwritten, which caused ffs_update() to attempt to update inode timestamps even though the file system was still mounted read-only. This fixes the "giving up on N buffers" error that is triggered by running fsck on the root file system and then rebooting without mounting the file system read-write.
122783	16-Nov-2003	wes	Write the UFS2 superblock with a 'BAD' magic number at the beginning of newfs, to signify the newfs operation has not yet completed. Re- write the superblock with the correct magic number once all of the cylinder groups have been created to show the operation has finished. Sponsored by: St. Bernard Software
122747	15-Nov-2003	phk	Send B_PHYS out to pasture, it no longer serves any function.
122596	13-Nov-2003	alc	Call free(9) after the vnode interlock is released, avoiding a lock-order reversal.
122537	12-Nov-2003	mckusick	Update the statfs structure with 64-bit fields to allow accurate reporting of multi-terabyte filesystem sizes. You should build and boot a new kernel BEFORE doing a `make world' as the new kernel will know about binaries using the old statfs structure, but an old kernel will not know about the new system calls that support the new statfs structure. Running an old kernel after a `make world' will cause programs such as `df' that do a statfs system call to fail with a bad system call. Reviewed by: Bruce Evans <bde@zeta.org.au> Reviewed by: Tim Robbins <tjr@freebsd.org> Reviewed by: Julian Elischer <julian@elischer.org> Reviewed by: the hoards of <arch@freebsd.org> Sponsored by: DARPA & NAI Labs.
122091	05-Nov-2003	kan	Remove mntvnode_mtx and replace it with per-mountpoint mutex. Introduce two new macros MNT_ILOCK(mp)/MNT_IUNLOCK(mp) to operate on this mutex transparently. Eventually new mutex will be protecting more fields in struct mount, not only vnode list. Discussed with: jeff
121925	03-Nov-2003	kan	Use VOP_UNLOCK/vrele instead of vput. td was erecived as a parameter and one cannot be sure it is equal to curthread.
121874	02-Nov-2003	kan	Take care not to call vput if thread used in corresponding vget wasn't curthread, i.e. when we receive a thread pointer to use as a function argument. Use VOP_UNLOCK/vrele in these cases. The only case there td != curthread known at the moment is boot() calling sync with thread0 pointer. This fixes the panic on shutdown people have reported.
121847	01-Nov-2003	kan	Temporarily undo parts of the stuct mount locking commit by jeff. It is unsafe to hold a mutex across vput/vrele calls. This will be redone when a better locking strategy is agreed upon. Discussed with: jeff
121785	31-Oct-2003	truckman	Tweak the calculation of minbfree in ffs_dirpref() so that only those cylinder groups that have at least 75% of the average free space per cylinder group for that file system are considered as candidates for the creation of a new directory. The previous formula for minbfree would set it to zero if the file system was more than 75% full, which allowed cylinder groups with no free space at all to be chosen as candidates for directory creation, which resulted in an expensive search for free blocks for each file that was subsequently created in that directory. Modify the calculation of minifree in the same way. Decrease maxcontigdirs as the file system fills to decrease the likelyhood that a cluster of directories will overflow the available space in a cylinder group. Reviewed by: mckusick Tested by: kmarx@vicor.com MFC after: 2 weeks
121443	23-Oct-2003	jhb	Move the P_COWINPROGRESS flag from being a per-process p_flag to being a per-thread td_pflag which doesn't require any locks to read or write as it is only read or written by curthread on itself. Glanced at by: mckusick
121354	22-Oct-2003	tegge	Initialize bp->b_offset to the physical offset in partition so GEOM knows where to read from disk.
121205	18-Oct-2003	phk	DuH! bp->b_iooffset (the spot on the disk), not bp->b_offset (the offset in the file)
121202	18-Oct-2003	phk	Initialize bp->b_offset before calling VOP_[SPEC]STRATEGY()
121158	17-Oct-2003	mckusick	When expunging unlinked files from a snapshot, skip over holes in the file rather than panicing with "indiracct: botched params". Submitted by: Mark Santcroos <marks@ripe.net>
120841	06-Oct-2003	jeff	- My last commit to this file is still not safe, I believe that it may be due to the recursion in indir_trunc().
120839	06-Oct-2003	jeff	- Reinstate 1.142 this was fixed by 1.144.
120825	05-Oct-2003	jeff	- The VCHR case in ffs_sync() is an unneccsary optimization especially considering how infrequently we access devices via ffs now that we have devfs. Collapse this case with the other case. Obtained from: bde
120805	05-Oct-2003	jeff	- Further simplify ffs_sync(). The vnode lock is required for UFS_UPDATE() so make the code slightly more uniform. The vnode lock is acquired in all cases and now the only difference between VCHR and other is we call UFS_UPDATE instead of VOP_FSYNC().
120804	05-Oct-2003	jeff	- In ffs_update() assert that either the vnode lock or the XLOCK is held.
120793	05-Oct-2003	jeff	- Check the XLOCK before inspecting v_data. - Slightly rewrite the fsync loop to be more lock friendly. We must acquire the vnode interlock before dropping the mnt lock. We must also check XLOCK to prevent vclean() races. - Use LK_INTERLOCK in the vget() in ffs_sync to further prevent vclean() races. - Use a local variable to store the results of the nvp == TAILQ_NEXT test so that we do not access the vp after we've vrele()d it. - Add an XXX comment about UFS_UPDATE() not being protected by any lock here. I suspect that it should need the VOP lock.
120789	05-Oct-2003	jeff	- Skip over xvp if XLOCK is set.
120777	05-Oct-2003	jeff	- Don't cache_purge() in ufs_reclaim. vclean() does it for us so this is redundant.
120763	04-Oct-2003	alc	Synchronize access to a vm page's valid field using the containing vm object's lock.
120750	04-Oct-2003	jeff	- The VI assert in getdirtybuf() is only valid if we're not on a VCHR vnode. VCHR vnodes don't do background writes. Reported by: kan
120741	04-Oct-2003	jeff	- Increase the scope of the interlock in ffs_reload(). Acquire it before we release the mntvnode_mtx. - Call vgonel() directly instead of going through vrecycle() since we own the interlock now. - Remove a few cases where we locked the interlock just so that we could call VOP_UNLOCK with interlock held.
120740	04-Oct-2003	jeff	- Fix an unlocked call to GETATTR by slightly shuffling the code in ffs_snapshot() around. - Acquire the interlock before releasing the mntvnode_mtx. Use the interlock to protect v_usecount access.
120738	04-Oct-2003	jeff	- Use the VI_LOCK macro in two places where we directly called mtx_lock() before. Direct calls indicated places that needed review and these have now been reviewed.
120737	04-Oct-2003	jeff	- Properly acquire the vnode interlock before releasing the mntvnode_mtx. - Use a local variable to store the results of the test to see if the next vnode on the mount list has changed. This is so that we no longer acess the vnode after we vput() it.
120732	04-Oct-2003	jeff	- Remove a mp_fixme() and some locks that weren't necessary. I now understand how this works.
119707	03-Sep-2003	jeff	- Several of the callers to getdirtybuf() were erroneously changed to pass in a list head instead of a pointer to the first element at the time of the first call. These lists are subject to change, and getdirtybuf() would refetch from the wrong list in some cases. Spottedy by: tegge Pointy hat to: me
119604	31-Aug-2003	jeff	- Backout rev 1.142. This caused a deadlock that I do not understand. More investigation is required.
119603	31-Aug-2003	jeff	- Define a new flag for getblk(): GB_NOCREAT. This flag causes getblk() to bail out if the buffer is not already present. - The buffer returned by incore() is not locked and should not be sent to brelse(). Use getblk() with the new GB_NOCREAT flag to preserve the desired semantics.
119601	31-Aug-2003	jeff	- Don't acquire the vnode interlock in drain_output(). Instead, require the caller to acquire it. This permits drain_output() to be done atomically with other operations as well as reducing the number of lock operations. - Assert that the proper locks are held in drain_output(). - Change getdirtybuf() to accept a mutex as an argument. This mutex is used to protect the vnode's buf list and the BKGRDWAIT flag. This lock is dropped when we successfully acquire a buffer and held on return otherwise. These semantics reduce the number of cumbersome cases in calling code. - Pass the mtx from getdirtybuf() into interlocked_sleep() and allow this mutex to be used as the interlock argument to BUF_LOCK() in the LOCKBUF case of interlocked_sleep(). - Change the return value of getdirtybuf() to be the resulting locked buffer or NULL otherwise. This is for callers who pass in a list head that requires a lock. It is necessary since the lock that protects the list head must be dropped in getdirtybuf() so that we don't have a lock order reversal with the buf queues lock in bremfree(). - Adjust all callers of getdirtybuf() to match the new semantics. - Add a comment in indir_trunc() that points at unlocked access to a buf. This may also be one of the last instances of incore() in the tree.
119521	28-Aug-2003	jeff	- Move BX_BKGRDWAIT and BX_BKGRDINPROG to BV_ and the b_vflags field. - Surround all accesses of the BKGRD{WAIT,INPROG} flags with the vnode interlock. - Don't use the B_LOCKED flag and QUEUE_LOCKED for background write buffers. Check for the BKGRDINPROG flag before recycling or throwing away a buffer. We do this instead because it is not safe for us to move the original buffer to a new queue from the callback on the background write buffer. - Remove the B_LOCKED flag and the locked buffer queue. They are no longer used. - The vnode interlock is used around checks for BKGRDINPROG where it may not be strictly necessary. If we hold the buf lock the a back-ground write will not be started without our knowledge, one may only be completed while we're not looking. Rather than remove the code, Document two of the places where this extra locking is done. A pass should be done to verify and minimize the locking later.
119088	18-Aug-2003	alc	The previous change necessitates the addition of a new #include. Otherwise, there is a compilation warning.
119049	17-Aug-2003	phk	Don't use a VOP_*() function on our own vnodes, go directly to the relevant internal function, in this case ufs_bmaparray().
118986	16-Aug-2003	alc	Revision 1.44 of ufs/ufs/inode.h has made it necessary to add two new #includes to this file. Otherwise, it doesn't compile.
118969	15-Aug-2003	phk	Eliminate the i_devvp field from the incore UFS inodes, we can get the same value from ip->i_ump->um_devvp. This saves a pointer in the memory copies of inodes, which can easily run into several hundred kilobytes. The extra indirection is unmeasurable in benchmarks. Approved by: mckusick
118607	07-Aug-2003	jhb	Consistently use the BSD u_int and u_short instead of the SYSV uint and ushort. In most of these files, there was a mixture of both styles and this change just makes them self-consistent. Requested by: bde (kern_ktrace.c)
118411	04-Aug-2003	rwatson	Now that the central POSIX.1e ACL code implements functions to generate the inode mode from a default ACL and creation mask, implement ufs_sync_inode_from_acl() using acl_posix1e_newfilemode(). Since ACL_OVERRIDE_MASK/ACL_PRESERVE_MASK are defined, we no longer need to explicitly pass in a "preserve_mask" field: this is implicit in the use of POSIX.1e semantics. Note: this change contains a semantic bugfix for new file creation: we now intersect the ACL-generated mode and the cmode requested by the user process. This means permissions on newly created file objects will now be more conservative. In the future, we may want to provide alternative semantics (similar to Solaris and Linux) in which the ACL mask overrides the umask, permitting ACLs to broaden the rights beyond the requested umask. PR: 50148 Reported by: Ritz, Bruno <bruno_ritz@gmx.ch> Obtained from: TrustedBSD Project
118404	04-Aug-2003	rwatson	In ufs_chmod(), use privilege only when required in the following cases: - Setting sticky bit on non-directory - Setting setgid on a file with a group that isn't in the effective or extended groups of the authorizing credential I.e., test the requirement first, then do the privilege test, rather than doing the privilege test regardless of the need for privilege. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
118131	28-Jul-2003	rwatson	Rename VOP_RMEXTATTR() to VOP_DELETEEXTATTR() for consistency with the kernel ACL interfaces and system call names. Break out UFS2 and FFS extattr delete and list vnode operations from setextattr and getextattr to deleteextattr and listextattr, which cleans up the implementations, and makes the results more readable, and makes the APIs more clear. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
118094	27-Jul-2003	phk	Add fdidx argument to vn_open() and vn_open_cred() and pass -1 throughout.
118047	26-Jul-2003	phk	Add a "int fd" argument to VOP_OPEN() which in the future will contain the filedescriptor number on opens from userland. The index is used rather than a "struct file " since it conveys a bit more information, which may be useful to in particular fdescfs and /dev/fd/ For now pass -1 all over the place.
117221	04-Jul-2003	phk	We just cached the inode pointer, no need to call VTOI() again.
116423	15-Jun-2003	alc	Lock the vm object when freeing pages.
116412	15-Jun-2003	phk	Add the same KASSERT to all VOP_STRATEGY and VOP_SPECSTRATEGY implementations to check that the buffer points to the correct vnode.
116384	15-Jun-2003	rwatson	Re-implement kernel access control for quotactl() as found in the UFS quota implementation. Push some quite broken access control logic out of ufs_quotactl() into the individual command implementations in ufs_quota.c; fix that logic. Pass in the thread argument to any quotactl command that will need to perform access control. o quotaon() requires privilege (PRISON_ROOT). o quotaoff() requires privilege (PRISON_ROOT). o getquota() requires that: If the type is USRQUOTA, either the effective uid match the requested quota ID, that the unprivileged_get_quota flag be set, or that the thread be privileged (PRISON_ROOT). If the type is GRPQUOTA, require that either the thread be a member of the group represented by the requested quota ID, that the unprivileged_get_quota flag be set, or that the thread be privileged (PRISON_ROOT). o setquota() requires privilege (PRISON_ROOT). o setuse() requires privilege (PRISON_ROOT). o qsync() requires no special privilege (consistent with what was present before, but probably not very useful). Add a new sysctl, security.bsd.unprivileged_get_quota, which when set to a non-zero value, will permit unprivileged users to query user quotas with non-matching uids and gids. Set this to 0 by default to be mostly consistent with the previous behavior (the same for USRQUOTA, but not for GRPQUOTA). Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
116271	12-Jun-2003	phk	Initialize struct vfsops C99-sparsely. Submitted by: hmp Reviewed by: phk
116192	11-Jun-2003	obrien	Use __FBSDID().
115869	05-Jun-2003	rwatson	Implement ffs_listextattr() by breaking out that logic and special-cased attribute name of "" from ffs_getextattr(). Invoking VOP_GETETATTR() with an empty name is now no longer supported; user application compatibility is provided by a system call level compatibility wrapper. We make sure to explicitly reject attempts to set an EA with the name "". Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
115865	05-Jun-2003	rwatson	Don't special-case handling of the empty string in the UFS1 extended attribute retrieval code: it's no longer special-cased, and is caught by the normal UFS1 EA validity checks (and, in fact, returns the same error, EINVAL). Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
115588	01-Jun-2003	rwatson	Return EOPNOTSUPP for attempted EA operations on VCHR vnodes in UFS2; if we permit them to occur, the kernel panics due to our performing EA operations using VOP_STRATEGY on the vnode. This went unnoticed previously because there are very for users of device nodes on UFS2 due to the introduction of devfs. However, this can come up with the Linux compat directories and its hard-coded dev nodes (which will need to go away as we move away from hard-coded device numbers). This can come up if you use EA-intensive features such as ACLs and MAC. The proper fix is pretty complicated, but this band-aid would be an excellent MFC candidate for the release.
115526	31-May-2003	phk	Remove unused variable. Found by: FlexeLint
115474	31-May-2003	phk	Remove unused local variables. Found by: FlexeLint
115456	31-May-2003	phk	The IO_NOWDRAIN and B_NOWDRAIN hacks are no longer needed to prevent deadlocks with vnode backed md(4) devices because md now uses a kthread to run the bio requests instead of doing it directly from the bio down path.
115145	18-May-2003	alc	Lock the vm object when performing vm_object_page_clean(). Approved by: re (rwatson)
115040	15-May-2003	rwatson	Jeff added locking assertions that the VV_ flags on vnodes were modified only while holding appropriate vnode locks. This patch slides the lock release for ufs_extattr_enable() to continue to hold the active vnode lock on a backing file until after the flag change; it also acquires a vnode lock when disabling an attribute and hence clearing a flag on the backing vnode. This permits VFS_DEBUG_LOCKS to run UFS1 extended attributes without panicking, as well as preventing a potential race and vnode flag problem. Approved by: re (jhb) Pointed out by: DEBUG_VFS_LOCKS
114599	03-May-2003	alc	Lock the vm_object on entry to vm_object_vndeallocate().
114396	01-May-2003	tjr	Do not attempt to free NULL dinodes (i_din1 or i_din2) in ffs_ifree(). These fields can be left as NULL if ffs_vget() allocates an inode but fails before the dinode memory has been allocated. There are two cases when this can occur: when we lose a race and another process has added the inode to the hash, and when reading the inode off disk fails. The bug was observed by Kris on one of the package-building machines. See http://marc.theaimsgroup.com/?l=freebsd-current&m=105172731013411&w=2 In Kris's case, it was the bread() that failed because of a disk error. The alternative to this patch is to ensure that ffs_vget() does not call vput() when the inode that hasn't been properly initialised.
114395	01-May-2003	tjr	Free i_din2 instead of i_din1 in ffs_ifree() on UFS2 filesystems. This is purely a cosmetic change because these members are in a union together.
114293	30-Apr-2003	markm	Fix some easy, global, lint warnings. In most cases, this means making some local variables static. In a couple of cases, this means removing an unused variable.
114216	29-Apr-2003	kan	Deprecate machine/limits.h in favor of new sys/limits.h. Change all in-tree consumers to include <sys/limits.h> Discussed on: standards@ Partially submitted by: Craig Rodrigues <rodrigc@attbi.com>
113872	22-Apr-2003	jhb	Lock both the proc lock and sched_lock when calling sched_nice since kg_nice is now protected by both. Being protected by both means that other places in the kernel that want to read kg_nice only need one of the two locks.
113376	12-Apr-2003	jeff	- Use the sched_nice() api instead of setting the nice value directly. Tested by: Steve Kargl <sgk@troutmask.apl.washington.edu>
113175	06-Apr-2003	alc	Sufficient access checks are performed by vmapbuf() that calling useracc() is pointless. Remove the call to useracc(). Don't reinitialize fields that are already initialized by getpbuf(). Reviewed by: tegge
112724	27-Mar-2003	tegge	Check return value from vmapbuf instead of the function address.
112718	27-Mar-2003	tegge	Eliminate a buffer sleep/wakeup race.
112694	26-Mar-2003	tegge	Add support for reading directly from file to userland buffer when the O_DIRECT descriptor status flag is set and both offset and length is a multiple of the physical media sector size.
112451	20-Mar-2003	jhb	Use td->td_ucred instead of td->td_proc->p_ucred.
112450	20-Mar-2003	jhb	Minor fixes to ffs_fserr(): - Assume that curthread is not NULL. It never is in -current. - Use td_ucred instead of p_ucred.
112367	18-Mar-2003	phk	Including <sys/stdint.h> is (almost?) universally only to be able to use %j in printfs, so put a newsted include in <sys/systm.h> where the printf prototype lives and save everybody else the trouble.
112181	13-Mar-2003	jeff	- Remove a race between fsync like functions and flushbufqueues() by requiring locked bufs in vfs_bio_awrite(). Previously the buf could have been written out by fsync before we acquired the buf lock if it weren't for giant. The cluster_wbuild() handles this race properly but the single write at the end of vfs_bio_awrite() would not. - Modify flushbufqueues() so there is only one copy of the loop. Pass a parameter in that says whether or not we should sync bufs with deps. - Call flushbufqueues() a second time and then break if we couldn't find any bufs without deps.
111972	07-Mar-2003	mckusick	Use the appropriate size when zeroing out the unused portion of a snapshot's copy of a superblock. This patch fixes a panic when taking a snapshot of a 4096/512 filesystem. Reported by: Ian Freislich <ianf@za.uu.net> Sponsored by: DARPA & NAI Labs.
111937	06-Mar-2003	alc	Remove ENABLE_VFS_IOOPT. It is a long unfinished work-in-progress. Discussed on: arch@
111856	04-Mar-2003	jeff	- Add a new 'flags' parameter to getblk(). - Define one flag GB_LOCK_NOWAIT that tells getblk() to pass the LK_NOWAIT flag to the initial BUF_LOCK(). This will eventually be used in cases were we want to use a buffer only if it is not currently in use. - Convert all consumers of the getblk() api to use this extra parameter. Reviwed by: arch Not objected to by: mckusick
111841	03-Mar-2003	njl	Finish cleanup of vprint() which was begun with changing v_tag to a string. Remove extraneous uses of vop_null, instead defering to the default op. Rename vnode type "vfs" to the more descriptive "syncer". Fix formatting for various filesystems that use vop_print.
111748	02-Mar-2003	des	More low-hanging fruit: kill caddr_t in calls to wakeup(9) / [mt]sleep(9).
111510	25-Feb-2003	mckusick	Change the field used to test whether the superblock has been updated from the filesystem size field to the filesystem maximum blocksize field. The problem is that older versions of growfs updated only the new size field and not the old size field. This resulted in the old (smaller) size field being copied up to the new size field which caused the filesystem to appear to fsck to be badly trashed. This also adds a sanity check to ensure that the superblock is not being updated when the filesystem is mounted read-only. Obviously such an update should never happen. Reported by: Nate Lawson <nate@root.org> Sponsored by: DARPA & NAI Labs.
111463	25-Feb-2003	jeff	- Add an interlock argument to BUF_LOCK and BUF_TIMELOCK. - Remove the buftimelock mutex and acquire the buf's interlock to protect these fields instead. - Hold the vnode interlock while locking bufs on the clean/dirty queues. This reduces some cases from one BUF_LOCK with a LK_NOWAIT and another BUF_LOCK with a LK_TIMEFAIL to a single lock. Reviewed by: arch, mckusick
111423	24-Feb-2003	das	Expand the reference count on struct dquot to 32 bits. This fixes a panic on large systems where a single user may have more than 64K active or inactive vnodes. PR: 48234 Reviewed by: mike (mentor)
111420	24-Feb-2003	mckusick	When removing the last item from a non-empty worklist, the worklist tail pointer must be updated. Reported by: Kris Kennaway <kris@obsecurity.org> Sponsored by: DARPA & NAI Labs.
111240	22-Feb-2003	mckusick	This patch fixes a deadlock between the bufdaemon and a process taking a snapshot. As part of taking a snapshot of a filesystem, the kernel builds up a list of the filesystem metadata (such as the cylinder group bitmaps) that are contained in the snapshot. When doing a copy-on-write check, the list is first consulted. If the block being written is found on the list, then the full snapshot lookup can be avoided. Besides providing an important performance speedup this check also avoids a potential deadlock between the code creating the snapshot and the bufdaemon trying to cleanup snapshot related buffers. This fix creates a temporary list containing the key metadata blocks that can cause the deadlock. This temporary list is used between the time that the snapshot is first enabled and the time that the fully complete list is built. Reported by: Attila Nagy <bra@fsn.hu> Sponsored by: DARPA & NAI Labs.
111239	22-Feb-2003	mckusick	This patch fixes a bug on an active filesystem on which a snapshot is being taken from panicing with either "freeing free block" or "freeing free inode". The problem arises when the snapshot code is scanning the filesystem looking for inodes with a reference count of zero (e.g., unlinked but still open) so that it can expunge them from its view. If it encounters a reclaimed vnode and has to restart its scan, then it will panic if it encounters and tries to free an inode that it has already processed. The fix is to check each candidate inode to see if it has already been processed before trying to delete it from the snapshot image. Sponsored by: DARPA & NAI Labs.
111238	22-Feb-2003	mckusick	This patch fixes a bug in the logical block calculation macros so that they convert to 64-bit values before shifting rather than afterwards. Once fixed, they can be used rather than inline expanded. Sponsored by: DARPA & NAI Labs.
111119	19-Feb-2003	imp	Back out M_* changes, per decision of the TRB. Approved by: trb
110885	14-Feb-2003	mckusick	Replace use of random() with arc4random() to provide less guessable values for the initial inode generation numbers in newfs and for newly allocated inode generation numbers in the kernel. Submitted by: Theo de Raadt <deraadt@cvs.openbsd.org> Sponsored by: DARPA & NAI Labs.
110837	14-Feb-2003	mckusick	Correct lines incorrectly added to the copyright message. Submitted by: Frank van der Linden <fvdl@wasabisystems.com> Sponsored by: DARPA & NAI Labs.
110584	09-Feb-2003	jeff	- Cleanup unlocked accesses to buf flags by introducing a new b_vflag member that is protected by the vnode lock. - Move B_SCANNED into b_vflags and call it BV_SCANNED. - Create a vop_stdfsync() modeled after spec's sync. - Replace spec_fsync, msdos_fsync, and hpfs_fsync with the stdfsync and some fs specific processing. This gives all of these filesystems proper behavior wrt MNT_WAIT/NOWAIT and the use of the B_SCANNED flag. - Annotate the locking in buf.h
110234	02-Feb-2003	alfred	Catch more uses of MIN().
109623	21-Jan-2003	alfred	Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
109153	13-Jan-2003	dillon	Bow to the whining masses and change a union back into void *. Retain removal of unnecessary casts and throw in some minor cleanups to see if anyone complains, just for the hell of it.
109123	12-Jan-2003	dillon	Change struct file f_data to un_data, a union of the correct struct pointer types, and remove a huge number of casts from code using it. Change struct xfile xf_data to xun_data (ABI is still compatible). If we need to add a #define for f_data and xf_data we can, but I don't think it will be necessary. There are no operational changes in this commit.
109053	10-Jan-2003	marcel	o Improve wording of the comment that accompanies fs_pad. The padding is not specific to non-i386 architectures. It is caused by non-i386 specific alignment requirements of fs_swuid, o Add a CTASSERT to catch a change in the size of struct fs at compile-time rather than run-time. Ok'd: gordon Tested on: i386 ia64
109034	09-Jan-2003	gordon	Fix superblock alignment problems on non-i386 platforms. Also change fs_uuid to fs_swuid, making it more descriptive. Submitted by: marcel Reviewed by: peter Pointy hat to: gordon
108970	08-Jan-2003	gordon	Steal some space from fs_fsmnt to create fs_volname and fs_uuid. The volname will be used to support volume names with the help of a GEOM module (to be committed). uuid will be used to deal with conflicting volume names (which doesn't work just yet). Approved by: mckusick@
108892	07-Jan-2003	mckusick	This patch fixes a problem caused by applications that rapidly and repeatedly truncate the same file. Each time the file is truncated, a buffer is grabbed to store the indirect block numbers that need to be freed. Those blocks cannot be freed until the inode claiming them is written to disk. Thus, the number of buffers being held by soft updates explodes and in extreme cases can run the kernel out of buffers. The problem can be avoided by doing an fsync on the file every debug.maxindirdep truncates (currently defaulted to 50). The fsync causes the inode to be written so that the held buffers can be freed. The check for excessive buffers is checked as part of the existing hook for excessive dependencies (softdep_slowdown) in the truncate code. Reported by: David Schultz <dschultz@uclink.Berkeley.EDU> Sponsored by: DARPA & NAI Labs. MFC after: 3 weeks
108686	04-Jan-2003	phk	Temporarily introduce a new VOP_SPECSTRATEGY operation while I try to sort out disk-io from file-io in the vm/buffer/filesystem space. The intent is to sort VOP_STRATEGY calls into those which operate on "real" vnodes and those which operate on VCHR vnodes. For the latter kind, the call will be changed to VOP_SPECSTRATEGY, possibly conditionally for those places where dual-use happens. Add a default VOP_SPECSTRATEGY method which will call the normal VOP_STRATEGY. First time it is called it will print debugging information. This will only happen if a normal vnode is passed to VOP_SPECSTRATEGY by mistake. Add a real VOP_SPECSTRATEGY in specfs, which does what VOP_STRATEGY does on a VCHR vnode today. Add a new VOP_STRATEGY method in specfs to catch instances where the conversion to VOP_SPECSTRATEGY has not yet happened. Handle the request just like we always did, but first time called print debugging information. Apart up to two instances of console messages per boot, this amounts to a glorified no-op commit. If you get any of the messages on your console I would very much like a copy of them mailed to phk@freebsd.org
108648	04-Jan-2003	phk	Since Jeffr made the std* functions the default in rev 1.63 of kern/vfs_defaults.c it is wrong for the individual filesystems to use the std* functions as that prevents override of the default. Found by: src/tools/tools/vop_table
108589	03-Jan-2003	phk	Convert calls to BUF_STRATEGY to VOP_STRATEGY calls. This is a no-op since all BUF_STRATEGY did in the first place was call VOP_STRATEGY.
108533	01-Jan-2003	schweikh	Correct typos, mostly s/ a / an / where appropriate. Some whitespace cleanup, especially in troff files.
108524	01-Jan-2003	alfred	When compiling the kernel do not implicitly include filedesc.h from proc.h, this was causing filedesc work to be very painful. In order to make this work split out sigio definitions to thier own header (sigio.h) which is included from proc.h for the time being.
108316	27-Dec-2002	phk	Use three UMA zones for FFS/UFS inodes instead of malloc space. Since inodes are currently 144 bytes, this will save 112 bytes per inode. This can amount to up to 10MByte on large systems.
108315	27-Dec-2002	phk	Move the allocation of the inode contents into ffs_vfsops.c rather than passing malloc types around.
108313	27-Dec-2002	phk	Make ffs_mountfs() static. Remove the malloctype from the ufs mount structure, instead add a callback to the storage method for freeing inodes: UFS_IFREE(). Add vfs_ifree() method function which frees an inode. Unvariablelize the malloc type used for allocating inodes.
108050	18-Dec-2002	mckusick	Fix corruption introduced in previous delta. Reported by: Aurelien Nephtali <aurelien.nephtali@wanadoo.fr> Sponsored by: DARPA & NAI Labs.
108017	18-Dec-2002	mckusick	Keep comments consistent with the code. Minor optimization. Sponsored by: DARPA & NAI Labs.
108010	18-Dec-2002	mckusick	Cosmetic cleanup of unsigned buglets. Submitted by: Bruce Evans <bde@zeta.org.au> Sponsored by: DARPA & NAI Labs.
107992	17-Dec-2002	phk	Remove unused lockcnt variable. Approved by: mckusick
107915	15-Dec-2002	mckusick	Update to previous change (1.54) to use an approperly wide inode field so as to work correctly on 64-bit platforms. Reported-by: Jake Burkholder <jake@locore.ca> Sponsored by: DARPA & NAI Labs. Approved by: Ian Dowse <iedowse@maths.tcd.ie>
107868	14-Dec-2002	iedowse	Undo the adjustment of the total memory used by dirhash in the case where allocating the dirhash structure fails. Fix a few typos in comments and update copyright. MFC after: 1 week
107848	14-Dec-2002	mckusick	Only the most recent snapshot contains the complete list of blocks that were copied in all of the earlier snapshots, thus its precomputed list must be used in the copyonwrite test. Using incomplete lists may lead to deadlock. Also do not include the blocks used for the indirect pointers in the indirect pointers as this may lead to inconsistent snapshots. Sponsored by: DARPA & NAI Labs. Approved by: re
107762	12-Dec-2002	trhodes	Remove the comment about dump(8) not working properly with snapshots. Discussed with: mckusick Approved by: re (rwatson)
107651	06-Dec-2002	mckusick	More tightly verify the preference returned for the new inode. Submitted by: Kris Kennaway <kris@obsecurity.org> Sponsored by: DARPA & NAI Labs. Approved by: re
107558	03-Dec-2002	mckusick	Have to use bread() rather than UFS_BALLOC() when obtaining a previously allocated block as the previous use of the block may have fallen out of the cache. Failure to reread its contents cause zeroed results to be written instead of the proper contents. Conversely, when the block is going to be entirely filled in, it is not necessary reread the old contents. Sponsored by: DARPA & NAI Labs. Approved by: re
107415	30-Nov-2002	mckusick	Add a check to disable the previous patch so that future filesystems that choose to place their superblocks in non-standard locations will not get them smashed. Sponsored by: DARPA & NAI Labs.
107414	30-Nov-2002	mckusick	Remove a race condition / deadlock from snapshots. When converting from individual vnode locks to the snapshot lock, be sure to pass any waiting processes along to the new lock as well. This transfer is done by a new function in the lock manager, transferlockers(from_lock, to_lock); Thanks to Lamont Granquist <lamont@scriptkiddie.org> for his help in pounding on snapshots beyond all reason and finding this deadlock. Sponsored by: DARPA & NAI Labs.
107406	30-Nov-2002	mckusick	Fix two deadlocks in snapshots: 1) Release the snapshot file lock while suspending the system. Otherwise a process trying to read the lock may block on its containing directory preventing the suspension from completing. Thanks to Sean Kelly <smkelly@zombie.org> for finding this deadlock. 2) Replace some bdwrite's with bawrite's so as not to fill all the buffers with dirty data. The buffers could not be cleaned as the snapshot vnode was locked hence the system could deadlock when making snapshots of really massive filesystems. Thanks to Hidetoshi Shimokawa <simokawa@sat.t.u-tokyo.ac.jp> for figuring this out. Sponsored by: DARPA & NAI Labs.
107393	29-Nov-2002	mckusick	Check to make sure that the fs_sblockloc field was properly updated before using it to write the superblock. This is to guard against accidentally trashing the disklabel if the superblock format missed being upgraded by the new kernel. Reported by: Sam Leffler <sam@errno.com> Sponsored by: DARPA & NAI Labs. Approved by: Murray Stokely <murray@FreeBSD.org>
107294	27-Nov-2002	mckusick	Create a new 32-bit fs_flags word in the superblock. Add code to move the old 8-bit fs_old_flags to the new location the first time that the filesystem is mounted by a new kernel. One of the unused flags in fs_old_flags is used to indicate that the flags have been moved. Leave the fs_old_flags word intact so that it will work properly if used on an old kernel. Change the fs_sblockloc superblock location field to be in units of bytes instead of in units of filesystem fragments. The old units did not work properly when the fragment size exceeeded the superblock size (8192). Update old fs_sblockloc values at the same time that the flags are moved. Suggested by: BOUWSMA Barry <freebsd-misuser@netscum.dyndns.dk> Sponsored by: DARPA & NAI Labs.
107096	20-Nov-2002	mckusick	The target for the maximum number of dependencies has been cut in half because of reports that under heavy load the kernel could exhaust its memory pool. The limit is now (desiredvnodes * 4) rather than (desiredvnodes * 8), so it will still scale with larger systems, just not as quickly. Sponsored by: DARPA & NAI Labs.
107095	20-Nov-2002	mckusick	If an error occurs while writing a buffer, then the data will not have hit the disk and the dependencies cannot be unrolled. In this case, the system will mark the buffer as dirty again so that the write can be retried in the future. When the write succeeds or the system gives up on the buffer and marks it as invalid (B_INVAL), the dependencies will be cleared. Sponsored by: DARPA & NAI Labs.
106965	15-Nov-2002	peter	Do not assume that time_t is an int. Approved by: re (jhb)
106673	08-Nov-2002	jhb	Print daddr_t's with %j and intmax_t.
106394	04-Nov-2002	rwatson	Update licenses and wording: NAI has authorized the removal of clause three of their BSD-style license; also, carry out the NAI Labs -> Network Associates Laboratories renaming in these files.
106058	27-Oct-2002	wollman	Implement the new 1003.1-2001 pathconf() keys, including the Advisory Information option. Other filesystem implementations should do something similar. With advice from: mckusick, phk
105988	26-Oct-2002	rwatson	Slightly change the semantics of vnode labels for MAC: rather than "refreshing" the label on the vnode before use, just get the label right from inception. For single-label file systems, set the label in the generic VFS getnewvnode() code; for multi-label file systems, leave the labeling up to the file system. With UFS1/2, this means reading the extended attribute during vfs_vget() as the inode is pulled off disk, rather than hitting the extended attributes frequently during operations later, improving performance. This also corrects sematics for shared vnode locks, which were not previously present in the system. This chances the cache coherrency properties WRT out-of-band access to label data, but in an acceptable form. With UFS1, there is a small race condition during automatic extended attribute start -- this is not present with UFS2, and occurs because EAs aren't available at vnode inception. We'll introduce a work around for this shortly. Approved by: re Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
105902	25-Oct-2002	mckusick	Within ufs, the ffs_sync and ffs_fsync functions did not always check for and/or report I/O errors. The result is that a VFS_SYNC or VOP_FSYNC called with MNT_WAIT could loop infinitely on ufs in the presence of a hard error writing a disk sector or in a filesystem full condition. This patch ensures that I/O errors will always be checked and returned. This patch also ensures that every call to VFS_SYNC or VOP_FSYNC with MNT_WAIT set checks for and takes appropriate action when an error is returned. Sponsored by: DARPA & NAI Labs.
105823	23-Oct-2002	mckusick	We must be careful to avoid recursive copy-on-write faults when trying to clean up during disk-full senarios. Sponsored by: DARPA & NAI Labs.
105763	23-Oct-2002	mckusick	Missplaced FREE_LOCK causes a panic when hit while taking a snapshot. Sponsored by: DARPA & NAI Labs.
105670	22-Oct-2002	mckusick	This update further fine tunes the locking of snapshot vnodes in the ffs_copyonwrite routine to avoid a deadlock between the syncer daemon trying to sync out a snapshot vnode and the bufdaemon trying to write out a buffer containing the snapshot inode. With any luck this will be the last snapshot race condition. Sponsored by: DARPA & NAI Labs.
105669	22-Oct-2002	mckusick	This update is a performance improvement when allocating blocks on a full filesystem. Previously, if the allocation failed, we had to fsync the file before rolling back any partial allocation of indirect blocks. Most block allocation requests only need to allocate a single data block and if that allocation fails, there is nothing to unroll. So, before doing the fsync, we check to see if any rollback will really be necessary. If none is necessary, then we simply return. This update eliminates the flurry of disk activity that got triggered whenever a filesystem would run out of space. Sponsored by: DARPA & NAI Labs.
105667	22-Oct-2002	mckusick	This checkin reimplements the io-request priority hack in a way that works in the new threaded kernel. It was commented out of the disksort routine earlier this year for the reasons given in kern/subr_disklabel.c (which is where this code used to reside before it moved to kern/subr_disk.c): ---------------------------- revision 1.65 date: 2002/04/22 06:53:20; author: phk; state: Exp; lines: +5 -0 Comment out Kirks io-request priority hack until we can do this in a civilized way which doesn't cause grief. The problem is that it is not generally safe to cast a "struct bio " to a "struct buf ". Things like ccd, vinum, ata-raid and GEOM constructs bio's which are not entrails of a struct buf. Also, curthread may or may not have anything to do with the I/O request at hand. The correct solution can either be to tag struct bio's with a priority derived from the requesting threads nice and have disksort act on this field, this wouldn't address the "silly-seek syndrome" where two equal processes bang the diskheads from one edge to the other of the disk repeatedly. Alternatively, and probably better: a sleep should be introduced either at the time the I/O is requested or at the time it is completed where we can be sure to sleep in the right thread. The sleep also needs to be in constant timeunits, 1/hz can be practicaly any sub-second size, at high HZ the current code practically doesn't do anything. ---------------------------- As suggested in this comment, it is no longer located in the disk sort routine, but rather now resides in spec_strategy where the disk operations are being queued by the thread that is associated with the process that is really requesting the I/O. At that point, the disk queues are not visible, so the I/O for positively niced processes is always slowed down whether or not there is other activity on the disk. On the issue of scaling HZ, I believe that the current scheme is better than using a fixed quantum of time. As machines and I/O subsystems get faster, the resolution on the clock also rises. So, ten years from now we will be slowing things down for shorter periods of time, but the proportional effect on the system will be about the same as it is today. So, I view this as a feature rather than a drawback. Hence this patch sticks with using HZ. Sponsored by: DARPA & NAI Labs. Reviewed by: Poul-Henning Kamp <phk@critter.freebsd.dk>
105572	20-Oct-2002	rwatson	Rename _POSIX_FOO_PRESENT and friends from POSIX.1e to _PC_FOO_PRESENT and related friends. This would have been corrected had POSIX.1e progressed to a standard. Pointed out by: wollman
105571	20-Oct-2002	rwatson	Implement _POSIX_ACL_PATH_MAX, which returns the maximum number of ACL entries for a file system node using pathconf(). Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
105567	20-Oct-2002	rwatson	Teach UFS to respond to pathconf() tests for _POSIX_ACL_EXTENDED and _POSIX_MAC_PRESENT based on available mount flags, if the services are available. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
105456	19-Oct-2002	rwatson	Clarify that the UFS1 extended attribute configuration steps do not apply to UFS2 file systems. Submitted by: jedgar Obtained from: TrustedBSD Project
105422	18-Oct-2002	dillon	Fix a file-rewrite performance case for UFS[2]. When rewriting portions of a file in chunks that are less then the filesystem block size, if the data is not already cached the system will perform a read-before-write. The problem is that it does this on a block-by-block basis, breaking up the I/Os and making clustering impossible for the writes. Programs such as INN using cyclic file buffers suffer greatly. This problem is only going to get worse as we use larger and larger filesystem block sizes. The solution is to extend the sequential heuristic so UFS[2] can perform a far larger read and readahead when dealing with this case. (note: maximum disk write bandwidth is 27MB/sec thru filesystem) (note: filesystem blocksize in test is 8K (1K frag)) dd if=/dev/zero of=test.dat bs=1k count=2m conv=notrunc Before: (note half of these are reads) tty da0 da1 acd0 cpu tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id 0 76 14.21 598 8.30 0.00 0 0.00 0.00 0 0.00 0 0 7 1 92 0 76 14.09 813 11.19 0.00 0 0.00 0.00 0 0.00 0 0 9 5 86 0 76 14.28 821 11.45 0.00 0 0.00 0.00 0 0.00 0 0 8 1 91 After: (note half of these are reads) tty da0 da1 acd0 cpu tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id 0 76 63.62 434 26.99 0.00 0 0.00 0.00 0 0.00 0 0 18 1 80 0 76 63.58 424 26.30 0.00 0 0.00 0.00 0 0.00 0 0 17 2 82 0 76 63.82 438 27.32 0.00 0 0.00 0.00 0 0.00 1 0 19 2 79 Reviewed by: mckusick Approved by: re X-MFC after: immediately (was heavily tested in -stable for 4 months)
105417	18-Oct-2002	rwatson	Update extended attribute readme file to note that no special configuration is required to use EAs with UFS2, and that UFS2 is recommend for EA use for a variety of reasons. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
105416	18-Oct-2002	rwatson	Update instructions for ACLs given recent tunefs, mount changes. Also note that UFS2 doesn't require explicit extended attribute configuration, and is recommends for this and other reasons if you plan to use ACLs. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
105415	18-Oct-2002	rwatson	Use 'size_t' instead of 'int' for the result of sizeof().
105368	18-Oct-2002	mckusick	With the revised single-lock method used in snapshots, the BA_NOWAIT flag is no longer needed. Sponsored by: DARPA & NAI Labs.
105191	16-Oct-2002	mckusick	Change locking so that all snapshots on a particular filesystem share a common lock. This change avoids a deadlock between snapshots when separate requests cause them to deadlock checking each other for a need to copy blocks that are close enough together that they fall into the same indirect block. Although I had anticipated a slowdown from contention for the single lock, my filesystem benchmarks show no measurable change in throughput on a uniprocessor system with three active snapshots. I conjecture that this result is because every copy-on-write fault must check all the active snapshots, so the process was inherently serial already. This change removes the last of the deadlocks of which I am aware in snapshots. Sponsored by: DARPA & NAI Labs.
105179	15-Oct-2002	rwatson	Push most UFS ACL behavior behind a check for MNT_ACLS, permitting ACLs to be administratively disabled as needed on UFS/UFS2 file systems. This also has the effect of preventing the slightly more expensive ACL code from running on non-ACL file systems, avoiding storage allocation for ACLs that may be read from disk. MNT_ACLS may be set at mount-time using mount -o acls, or implicitly by setting the FS_ACLS flag using tunefs. On UFS1, you may also have to configure ACL store. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
105169	15-Oct-2002	rwatson	If the FS_MULTILABEL flag is set in a UFS or UFS2 superblock, automatically set MNT_MULTILABEL in the mount flags. If FS_ACLS is set in a UFS or UFS2 superblock, automatically set MNT_ACLS in the mount flags. If either of these flags is set, but the appropriate kernel option to support the features associated with the flag isn't available, then print a warning at mount-time. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
105136	14-Oct-2002	mckusick	When reading or writing the extended attributes of a special device or fifo in UFS2, the normal ufs_strategy routine needs to be used rather than the spec_strategy or fifo_strategy routine. Thus the ffsext_strategy routine is interposed in the ffs_vnops vectors for special devices and fifo's to pick off this special case. Otherwise it simply falls through to the usual spec_strategy or fifo_strategy routine. Submitted by: Robert Watson <rwatson@FreeBSD.org> Sponsored by: DARPA & NAI Labs.
105123	14-Oct-2002	rwatson	Fix two memory leaks in error conditions involving the UFS ACL code: if failures occur, make sure that we release both the default ACL and access ACL storage during new object creation. Spotted by: phk and his pet flexelint Sponsored by: DARPA, Network Associates Laboratories
105112	14-Oct-2002	rwatson	Define two new superblock file system flags: FS_ACLS Administrative enable/disable of extended ACL support FS_MULTILABEL Administrative flag to indicate to the MAC Framework that objects in the file system are individually labeled using extended attributes. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories Reviewed by: (in principal) mckusick, phk
105077	14-Oct-2002	mckusick	Regularize the vop_stdlock'ing protocol across all the filesystems that use it. Specifically, vop_stdlock uses the lock pointed to by vp->v_vnlock. By default, getnewvnode sets up vp->v_vnlock to reference vp->v_lock. Filesystems that wish to use the default do not need to allocate a lock at the front of their node structure (as some still did) or do a lockinit. They can simply start using vn_lock/VOP_UNLOCK. Filesystems that wish to manage their own locks, but still use the vop_stdlock functions (such as nullfs) can simply replace vp->v_vnlock with a pointer to the lock that they wish to have used for the vnode. Such filesystems are responsible for setting the vp->v_vnlock back to the default in their vop_reclaim routine (e.g., vp->v_vnlock = &vp->v_lock). In theory, this set of changes cleans up the existing filesystem lock interface and should have no function change to the existing locking scheme. Sponsored by: DARPA & NAI Labs.
104908	11-Oct-2002	mike	Change iov_base's type from `char ' to the standard `void '. All uses of iov_base which assume its type is `char ' (in order to do pointer arithmetic) have been updated to cast iov_base to `char '.
104716	09-Oct-2002	mux	Fix build of 64 bit platforms.
104702	09-Oct-2002	mckusick	When creating a snapshot, create a list of initially allocated blocks. Whenever doing a copy-on-write check, first look in the list of initially allocated blocks to see if it is there. If so, no further check is needed. If not, fall through and do the full check. This change eliminates one of two known deadlocks caused by snapshots. Handling the second deadlock will be the subject of another check-in. This change also reduces the cost of the copy-on-write check by speeding up the verification of frequently checked blocks. Sponsored by: DARPA & NAI Labs.
104698	09-Oct-2002	mckusick	When creating a snapshot, create a list of initially allocated blocks. Whenever doing a copy-on-write check, first look in the list of initially allocated blocks to see if it is there. If so, no further check is needed. If not, fall through and do the full check. This change eliminates one of two known deadlocks caused by snapshots. Handling the second deadlock will be the subject of another check-in. This change also reduces the cost of the copy-on-write check by speeding up the verification of frequently checked blocks. Sponsored by: DARPA & NAI Labs.
104697	09-Oct-2002	mckusick	The appropriate units for disk block addresses are always DEV_BSIZE, even when the underlying device has a larger sector size. Therefore, the filesystem code should not (and with this patch does not) try to use the underlying sector size when doing disk block address calculations. This patch fixes problems in -current when using the swap-based memory-disk device (mdconfig -a -t swap ...). This bugfix is not relevant to -stable as -stable does not have the memory-disk device. Sponsored by: DARPA & NAI Labs.
104688	08-Oct-2002	jeff	- Remove LK_INTERLOCK from the vn_lock() in ffs_snapshot(). Pointy hat to: me Found by: green
104364	02-Oct-2002	phk	Mark two places where an unsigned number is checked "if (foo < 0)" with an XXX comment. Somebody[TM] should look at this in some detail. Spotted by: FlexeLint
104346	02-Oct-2002	dd	size_t is not a struct (fix mislabelling in a comment).
104302	01-Oct-2002	phk	Fix some harmless mis-indents. Spotted by: FlexeLint
104104	28-Sep-2002	jmallett	When spamming me with a printf(9), under DIAGNOSTIC, at least be nice enough to include a newline. MFC after: 4 days Sponsored by: Bright Path Solutions
104094	28-Sep-2002	phk	Be consistent about "static" functions: if the function is marked static in its prototype, mark it static at the definition too. Inspired by: FlexeLint warning #512
104052	27-Sep-2002	phk	Make it a tad easier to deal with struct inode in userland programs which fondle /dev/kmem by using "struct cdev *" instead of "dev_t". Requsted by: jake
104051	27-Sep-2002	phk	Use our mount-credential if we get a NOCRED when we try to write out EA space back to disk. This is wrong in many ways, but not as wrong as a panic. Pancied on: rwatson & jmallet Sponsored by: DARPA & NAI Labs.
103946	25-Sep-2002	jeff	- Convert locks to use standard macros. - Lock access to the buflists. - Document broken locking. - Use vrefcnt().
103945	25-Sep-2002	jeff	- Document broken locking. - Use vrefcnt().
103944	25-Sep-2002	jeff	- Lock accesses to v_usecount. - Convert interlock locks to use standard macros.
103943	25-Sep-2002	jeff	- Don't use the interlock to protect v_writecount.
103690	20-Sep-2002	phk	We don't need to #include <sys/disklabel.h>. We don't need to #include <sys/disklabel.h> second time either. Sponsored by: DARPA & NAI Labs.
103636	19-Sep-2002	truckman	VOP_FSYNC() requires that it's vnode argument be locked, which nfs_link() wasn't doing. Rather than just lock and unlock the vnode around the call to VOP_FSYNC(), implement rwatson's suggestion to lock the file vnode in kern_link() before calling VOP_LINK(), since the other filesystems also locked the file vnode right away in their link methods. Remove the locking and and unlocking from the leaf filesystem link methods. Reviewed by: rwatson, bde (except for the unionfs_link() changes)
103594	19-Sep-2002	obrien	intmax_t is printed with %jd, not %lld.
103559	18-Sep-2002	njl	Remove any VOP_PRINT that redundantly prints the tag. Move lockmgr_printinfo() into vprint() for everyone's benefit. Suggested by: bde
103314	14-Sep-2002	njl	Remove all use of vnode->v_tag, replacing with appropriate substitutes. v_tag is now const char * and should only be used for debugging. Additionally: 1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK 2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP. Suggested by: phk Reviewed by: bde, rwatson (earlier version)
103180	10-Sep-2002	bde	vfs_syscalls.c: Changed rename(2) to follow the letter of the POSIX spec. POSIX requires rename() to have no effect if its args "resolve to the same existing file". I think "file" can only reasonably be read as referring to the inode, although the rationale and "resolve" seem to say that sameness is at the level of (resolved) directory entries. ext2fs_vnops.c, ufs_vnops.c: Replaced code that gave the historical BSD behaviour of removing one link name by checks that this code is now unreachable. This fixes some races. All vnodes needed to be unlocked for the removal, and locking at another level using something like IN_RENAME was not even attempted, so it was possible for rename(x, y) to return with both x and y removed even without any unlink(2) syscalls (one process can remove x using rename(x, y) and another process can remove y using rename(y, x)). Prodded by: alfred MFC after: 8 weeks PR: 42617
102991	05-Sep-2002	phk	Implement the VOP_OPENEXTATTR() and VOP_CLOSEEXTATTR() methods. Use extattr_check_cred() to check access to EAs. This is still a WIP. Sponsored by: DARPA & NAI Labs.
102988	05-Sep-2002	phk	Use canonical extattr_check_cred() instead of private implementation of the same policy. Sponsored by: DARPA & NAI Labs.
102985	05-Sep-2002	phk	Fix credentials check: do not leak ENOATTR until we know if they're supposed to know. Sponsored by: DARPA & NAI Labs.
102957	05-Sep-2002	bde	Include <sys/malloc.h> instead of depending on namespace pollution 2 layers deep in <sys/proc.h> or <sys/vnode.h>. Include <sys/vmmeter.h> instead of depending on namespace pollution in <sys/pcpu.h>. Sorted includes as much as possible.
102774	01-Sep-2002	rwatson	Since we have vp and td cached in local variables, use those instead of derefencing the VOP arguments again when calling the UFS code. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
102608	30-Aug-2002	phk	Correctly handle setting, getting and deleting EA's with zero length content. Sponsored by: DARPA & NAI Labs.
102412	25-Aug-2002	charnier	Replace various spelling with FALLTHROUGH which is lint()able
102382	25-Aug-2002	alc	o Retire vm_page_zero_fill() and vm_page_zero_fill_area(). Ever since pmap_zero_page() and pmap_zero_page_area() were modified to accept a struct vm_page * instead of a physical address, vm_page_zero_fill() and vm_page_zero_fill_area() have served no purpose.
102175	20-Aug-2002	phk	Implement list of EA return functionality. Correctly delete EA's when the content length is set to zero. Sponsored by: DARPA & NAI Labs.
102090	19-Aug-2002	phk	First snapshot of UFS2 EA support. Sponsored by: DARPA & NAI Labs.
101941	15-Aug-2002	rwatson	In order to better support flexible and extensible access control, make a series of modifications to the credential arguments relating to file read and write operations to cliarfy which credential is used for what: - Change fo_read() and fo_write() to accept "active_cred" instead of "cred", and change the semantics of consumers of fo_read() and fo_write() to pass the active credential of the thread requesting an operation rather than the cached file cred. The cached file cred is still available in fo_read() and fo_write() consumers via fp->f_cred. These changes largely in sys_generic.c. For each implementation of fo_read() and fo_write(), update cred usage to reflect this change and maintain current semantics: - badfo_readwrite() unchanged - kqueue_read/write() unchanged pipe_read/write() now authorize MAC using active_cred rather than td->td_ucred - soo_read/write() unchanged - vn_read/write() now authorize MAC using active_cred but VOP_READ/WRITE() with fp->f_cred Modify vn_rdwr() to accept two credential arguments instead of a single credential: active_cred and file_cred. Use active_cred for MAC authorization, and select a credential for use in VOP_READ/WRITE() based on whether file_cred is NULL or not. If file_cred is provided, authorize the VOP using that cred, otherwise the active credential, matching current semantics. Modify current vn_rdwr() consumers to pass a file_cred if used in the context of a struct file, and to always pass active_cred. When vn_rdwr() is used without a file_cred, pass NOCRED. These changes should maintain current semantics for read/write, but avoid a redundant passing of fp->f_cred, as well as making it more clear what the origin of each credential is in file descriptor read/write operations. Follow-up commits will make similar changes to other file descriptor operations, and modify the MAC framework to pass both credentials to MAC policy modules so they can implement either semantic for revocation. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
101789	13-Aug-2002	phk	Expand the arguments to ffs_ext{read,write}() to their component parts rather than use vop_{read,write}_args. Access to these functions will ultimately not be available through the "vop_{read,write}+IO_EXT" API but this functionality is retained for debugging purposes for now. Sponsored by: DARPA & NAI Labs.
101780	13-Aug-2002	phk	Unravel the UFS_EXTATTR incest between FFS and UFS: UFS_EXTATTR is an UFS only thing, and FFS should in principle not know if it is enabled or not. This commit cleans ffs_vnops.c for such knowledge, but not ffs_vfsops.c Sponsored by: DARPA and NAI Labs.
101777	13-Aug-2002	phk	Introduce typedefs for the member functions of struct vfsops and employ these in the main filesystems. This does not change the resulting code but makes the source a little bit more grepable. Sponsored by: DARPA and NAI Labs.
101744	12-Aug-2002	rwatson	Pass IO_NOMACCHECK to vn_rdwr() in the following checks to prevent enforcement of MAC policy on the read or write operations: - In ext2fs, don't enforce MAC on loop-back reads and writes supporting directory read operations in lookup(), directory modifications in rename(), directory write operations in mkdir(), symlink write operations in symlink(). - In the NFS client locking code, perform vn_rdwr() on the NFS locking socket without enforcing MAC, since the write is done on behalf of the kernel NFS implementation rather than the user process. - In UFS, don't enforce MAC on loop-back reads and writes supporting directory read operations in lookup(), and symlink write operations in symlink(). Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
101720	12-Aug-2002	phk	Stop pretending that the FFS file ufs_readwrite.c is a UFS file. Instead of #including it, pull it into ffs_vnops.c and name things correctly. Sponsored by: DARPA & NAI Labs.
101717	12-Aug-2002	phk	Fix a comment.
101398	05-Aug-2002	iedowse	Don't call softdep_slowdown() if soft updates are not active on the filesystem. This causes a panic for kernels compiled without softupdates. Reported by: luigi
101308	04-Aug-2002	jeff	- Replace v_flag with v_iflag and v_vflag - v_vflag is protected by the vnode lock and is used when synchronization with VOP calls is needed. - v_iflag is protected by interlock and is used for dealing with vnode management issues. These flags include X/O LOCK, FREE, DOOMED, etc. - All accesses to v_iflag and v_vflag have either been locked or marked with mp_fixme's. - Many ASSERT_VOP_LOCKED calls have been added where the locking was not clear. - Many functions in vfs_subr.c were restructured to provide for stronger locking. Idea stolen from: BSD/OS
101073	31-Jul-2002	rwatson	Introduce support for Mandatory Access Control and extensible kernel access control. Instrument UFS to support per-inode MAC labels. In particular, invoke MAC framework entry points for generically supporting the backing of MAC labels into extended attributes. This ends up introducing new vnode operation vector entries point at the MAC framework entry points, as well as some explicit entry point invocations for file and directory creation events so that the MAC framework can push labels to disk before the directory names become persistent (this will work better once EAs in UFS2 are hooked into soft updates). The generic EA MAC entry points support executing with the file system in either single label or multilabel operation, and will fall back to the mount label if multilabel is not specified at mount-time. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
101018	31-Jul-2002	phk	I forgot this bit of uglyness in the fsck_ffs cleanup.
100926	30-Jul-2002	phk	Fix braino in last commit.
100925	30-Jul-2002	phk	Move ffs_isfreeblock() to ffs_alloc.c and make it static. Sponsored by: DARPA & NAI Labs.
100807	28-Jul-2002	alc	Lock page queue accesses by vm_page_free().
100393	20-Jul-2002	benno	Add a missing argument to the stub for softdep_setup_freeblocks. Forgotten by: mckusick
100382	20-Jul-2002	peter	Fix a warning: ffs_softdep.c:1630: warning: int format, different type arg (arg 2)
100344	19-Jul-2002	mckusick	Add support to UFS2 to provide storage for extended attributes. As this code is not actually used by any of the existing interfaces, it seems unlikely to break anything (famous last words). The internal kernel interface to manipulate these attributes is invoked using two new IO_ flags: IO_NORMAL and IO_EXT. These flags may be specified in the ioflags word of VOP_READ, VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that you want to do I/O to the normal data part of the file and IO_EXT means that you want to do I/O to the extended attributes part of the file. IO_NORMAL and IO_EXT are mutually exclusive for VOP_READ and VOP_WRITE, but may be specified individually or together in the case of VOP_TRUNCATE. For example, when removing a file, VOP_TRUNCATE is called with both IO_NORMAL and IO_EXT set. For backward compatibility, if neither IO_NORMAL nor IO_EXT is set, then IO_NORMAL is assumed. Note that the BA_ and IO_ flags have been `merged' so that they may both be used in the same flags word. This merger is possible by assigning the IO_ flags to the low sixteen bits and the BA_ flags the high sixteen bits. This works because the high sixteen bits of the IO_ word is reserved for read-ahead and help with write clustering so will never be used for flags. This merge lets us get away from code of the form: if (ioflags & IO_SYNC) flags \|= BA_SYNC; For the future, I have considered adding a new field to the vattr structure, va_extsize. This addition could then be exported through the stat structure to allow applications to find out the size of the extended attribute storage and also would provide a more standard interface for truncating them (via VOP_SETATTR rather than VOP_TRUNCATE). I am also contemplating adding a pathconf parameter (for concreteness, lets call it _PC_MAX_EXTSIZE) which would let an application determine the maximum size of the extended atribute storage. Sponsored by: DARPA & NAI Labs.
100207	17-Jul-2002	mckusick	Change utimes to set the file creation time (for filesystems that support creation times such as UFS2) to the value of the modification time if the value of the modification time is older than the current creation time. See utimes(2) for further details. Sponsored by: DARPA & NAI Labs.
100201	16-Jul-2002	mckusick	Change the name of st_createtime to st_birthtime. This change is made to reduce confusion between st_ctime and st_createtime. Submitted by: Eric Allman <eric@sendmail.org> Sponsored by: DARPA & NAI Labs.
99888	12-Jul-2002	trhodes	Fix a type: s/your are/you are/
99590	08-Jul-2002	bde	Fixed some printf format errors (4 new ones reported by gcc and 5 nearby old ones not reported by gcc). This helps unbreak LINT.
99220	01-Jul-2002	iedowse	Use indirect function pointer hooks instead of #ifdef SOFTUPDATES direct calls for the two places where the kernel calls into soft updates code. Set up the hooks in softdep_initialize() and NULL them out in softdep_uninitialize(). This change allows soft updates to function correctly when ufs is loaded as a module. Reviewed by: mckusick
99206	01-Jul-2002	iedowse	Add the ffs bits necessary to support unloading of the ufs kernel module. This adds an ffs_uninit() function that calls ufs_uninit() and also calls a new softdep_uninitialize() function. Add a stub for softdep_uninitialize() to cover the non-SOFTUPDATES case. Reviewed by: mckusick
99101	30-Jun-2002	iedowse	Remove the bogus SYSINIT from ufs_dirhash.c and instead add a call to ufsdirhash_init() from ufs_init(). Add uninit() functions corresponding the ufs, dirhash, quota and ihash init() functions.
98888	26-Jun-2002	iedowse	Remove the kernel file-size limit for UFS2, so that only the limit imposed by the filesystem structure itself remains. With 16k blocks, the maximum file size is now just over 128TB. For now, the UFS1 file size limit is left unchanged so as to remain consistent with RELENG_4, but it too could be removed in the future. Reviewed by: mckusick
98849	26-Jun-2002	ken	At long last, commit the zero copy sockets code. MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes. ti.4: Update the ti(4) man page to include information on the TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options, and also include information about the new character device interface and the associated ioctls. man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated links. jumbo.9: New man page describing the jumbo buffer allocator interface and operation. zero_copy.9: New man page describing the general characteristics of the zero copy send and receive code, and what an application author should do to take advantage of the zero copy functionality. NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS, TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT. conf/files: Add uipc_jumbo.c and uipc_cow.c. conf/options: Add the 5 options mentioned above. kern_subr.c: Receive side zero copy implementation. This takes "disposable" pages attached to an mbuf, gives them to a user process, and then recycles the user's page. This is only active when ZERO_COPY_SOCKETS is turned on and the kern.ipc.zero_copy.receive sysctl variable is set to 1. uipc_cow.c: Send side zero copy functions. Takes a page written by the user and maps it copy on write and assigns it kernel virtual address space. Removes copy on write mapping once the buffer has been freed by the network stack. uipc_jumbo.c: Jumbo disposable page allocator code. This allocates (optionally) disposable pages for network drivers that want to give the user the option of doing zero copy receive. uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are enabled if ZERO_COPY_SOCKETS is turned on. Add zero copy send support to sosend() -- pages get mapped into the kernel instead of getting copied if they meet size and alignment restrictions. uipc_syscalls.c:Un-staticize some of the sf* functions so that they can be used elsewhere. (uipc_cow.c) if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid calling malloc() with M_WAITOK. Return an error if the M_NOWAIT malloc fails. The ti(4) driver and the wi(4) driver, at least, call this with a mutex held. This causes witness warnings for 'ifconfig -a' with a wi(4) or ti(4) board in the system. (I've only verified for ti(4)). ip_output.c: Fragment large datagrams so that each segment contains a multiple of PAGE_SIZE amount of data plus headers. This allows the receiver to potentially do page flipping on receives. if_ti.c: Add zero copy receive support to the ti(4) driver. If TI_PRIVATE_JUMBOS is not defined, it now uses the jumbo(9) buffer allocator for jumbo receive buffers. Add a new character device interface for the ti(4) driver for the new debugging interface. This allows (a patched version of) gdb to talk to the Tigon board and debug the firmware. There are also a few additional debugging ioctls available through this interface. Add header splitting support to the ti(4) driver. Tweak some of the default interrupt coalescing parameters to more useful defaults. Add hooks for supporting transmit flow control, but leave it turned off with a comment describing why it is turned off. if_tireg.h: Change the firmware rev to 12.4.11, since we're really at 12.4.11 plus fixes from 12.4.13. Add defines needed for debugging. Remove the ti_stats structure, it is now defined in sys/tiio.h. ti_fw.h: 12.4.11 firmware. ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13, and my header splitting patches. Revision 12.4.13 doesn't handle 10/100 negotiation properly. (This firmware is the same as what was in the tree previously, with the addition of header splitting support.) sys/jumbo.h: Jumbo buffer allocator interface. sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to indicate that the payload buffer can be thrown away / flipped to a userland process. socketvar.h: Add prototype for socow_setup. tiio.h: ioctl interface to the character portion of the ti(4) driver, plus associated structure/type definitions. uio.h: Change prototype for uiomoveco() so that we'll know whether the source page is disposable. ufs_readwrite.c:Update for new prototype of uiomoveco(). vm_fault.c: In vm_fault(), check to see whether we need to do a page based copy on write fault. vm_object.c: Add a new function, vm_object_allocate_wait(). This does the same thing that vm_object allocate does, except that it gives the caller the opportunity to specify whether it should wait on the uma_zalloc() of the object structre. This allows vm objects to be allocated while holding a mutex. (Without generating WITNESS warnings.) vm_object_allocate() is implemented as a call to vm_object_allocate_wait() with the malloc flag set to M_WAITOK. vm_object.h: Add prototype for vm_object_allocate_wait(). vm_page.c: Add page-based copy on write setup, clear and fault routines. vm_page.h: Add page based COW function prototypes and variable in the vm_page structure. Many thanks to Drew Gallatin, who wrote the zero copy send and receive code, and to all the other folks who have tested and reviewed this code over the years.
98788	25-Jun-2002	mckusick	Force the quota update to be done when an inode is released in ufs_inactive. This avoid a panic when checking a NULL credential in suser_cred().
98770	24-Jun-2002	jlemon	Prototype fixes (long newinum --> ino_t newinum).
98687	23-Jun-2002	mux	Warning fixes for 64 bits platforms. This eliminates all the warnings I have had in the FFS code on sparc64. Reviewed by: mckusick
98658	23-Jun-2002	dillon	Rename the BALLOC flags from B_* to BA_* to avoid confusion with the struct buf B_ flags. Approved by: mckusick
98640	22-Jun-2002	mckusick	This patch fixes a problem whereby filesystems that ran out of inodes in a cylinder group would fail to check for free inodes in other cylinder groups. This bug was introduced in the UFS2 code merge two days ago. An inode is allocated by calling ffs_valloc which calls ffs_hashalloc to do the filesystem scan. Ffs_hashalloc walks around the cylinder groups calling its passed allocator (ffs_nodealloccg in this case) until the allocator returns a non-zero result. The bug is that ffs_hashalloc expects the passed allocator function to return a 64-bit ufs2_daddr_t. When allocating inodes, it calls ffs_nodealloccg which was returning a 32-bit ino_t. The ffs_hashalloc code checked a 64-bit return value and usually found random non-zero bits in the high 32-bits so decided that the allocation had succeeded (in this case in the only cylinder group that it checked). When the result was passed back to ffs_valloc it looked at only the bottom 32-bits, saw zero and declared the system out of inodes. But ffs_hashalloc had really only checked one cylinder group. The fix is to change ffs_nodealloccg to return 64-bit results. Sponsored by: DARPA & NAI Labs. Submitted by: Poul-Henning Kamp <phk@critter.freebsd.dk> Reviewed by: Maxime Henrion <mux@freebsd.org>
98542	21-Jun-2002	mckusick	This commit adds basic support for the UFS2 filesystem. The UFS2 filesystem expands the inode to 256 bytes to make space for 64-bit block pointers. It also adds a file-creation time field, an ability to use jumbo blocks per inode to allow extent like pointer density, and space for extended attributes (up to twice the filesystem block size worth of attributes, e.g., on a 16K filesystem, there is space for 32K of attributes). UFS2 fully supports and runs existing UFS1 filesystems. New filesystems built using newfs can be built in either UFS1 or UFS2 format using the -O option. In this commit UFS1 is the default format, so if you want to build UFS2 format filesystems, you must specify -O 2. This default will be changed to UFS2 when UFS2 proves itself to be stable. In this commit the boot code for reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c) as there is insufficient space in the boot block. Once the size of the boot block is increased, this code can be defined. Things to note: the definition of SBSIZE has changed to SBLOCKSIZE. The header file <ufs/ufs/dinode.h> must be included before <ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and ufs_lbn_t. Still TODO: Verify that the first level bootstraps work for all the architectures. Convert the utility ffsinfo to understand UFS2 and test growfs. Add support for the extended attribute storage. Update soft updates to ensure integrity of extended attribute storage. Switch the current extended attribute interfaces to use the extended attribute storage. Add the extent like functionality (framework is there, but is currently never used). Sponsored by: DARPA & NAI Labs. Reviewed by: Poul-Henning Kamp <phk@freebsd.org>
98425	19-Jun-2002	dillon	In rev 1.72 a situation related to write/mmap was fixed which could result in a user process gaining visibility into the 'old' contents of a filesystem block. There were two cases: (1) when uiomove() fails (user process issues illegal write), and (2) when uiomove() overlaps a mmap() of the same file at the same offset (fault -> recursive buffer I/O reads contents of old block). Unfortunately 1.72 also had the unintended effect of forcing the filesystem to do a read-before-write in the case of a full-block-write (non append case), e.g. 'dd if=/dev/zero of=test.dat bs=1m count=256 conv=notrunc'. This destroys performance.. not only is a read forced for every write, but clustering breaks as well. The solution is to clear the buffer manually in the full-block case rather then asking BALLOC to do it (BALLOC issues the read-before-write). In the partial-block case we want BALLOC to do it because the read-before-write is necessary. This patch should greatly improve database and news-feed server performance. Found by: MKI <mki@mozone.net> MFC after: 3 days
97962	06-Jun-2002	semenu	Fix a typo in my recently added comment: s/beleived/believed/ Submitted by: keramida
97724	01-Jun-2002	alfred	Backout/modify previous revision: "empty default cases shouldn't be removed, they should have a break; statement added to them." Requested by: billf
97723	01-Jun-2002	alfred	Silence warnings, remove some empty 'default' switch cases.
97640	30-May-2002	semenu	Remove lock from ffs_vget introduced by v1.24. Instead of locking the vnode creation globaly, we allow processes to create vnodes concurently. In case of concurent creation of vnode for the one ino, we allow processes to race and then check who wins. Assuming that concurent creation of vnode for same ino is really rare case, this is belived to be an improvement, as it just allows concurent creation of vnodes. Idea by: bp Reviewed by: dillon MFC after: 1 month
96885	19-May-2002	rwatson	Remove IFS from 5.0-CURRENT. This facilitates introducing UFS2 as IFS had its fingers deep in the belly of the UFS/FFS split. IFS will be reimplemented by the maintainer at a later date. Requested by: adrian (maintainer)
96876	18-May-2002	iedowse	Fix two casts to "daddr_t " that should have been "ufs_daddr_t ".
96874	18-May-2002	iedowse	Fix a typo where sizeof(daddr_t) was specified instead of sizeof(doff_t). Now that daddr_t is 64-bit, this caused hash blocks to be allocated twice as large as they need to be.
96873	18-May-2002	iedowse	Remove um_i_effnlink_valid, i_spare[] and the ufsmount_u and inode_u unions, since these were only necessary when ext2fs used ufs code. Reviewed by: mckusick
96821	17-May-2002	phk	Fix ufs_daddr_t/daddr_t type problems. Sponsored by: DARPA & NAI labs.
96820	17-May-2002	phk	Call ufs_bmaparray() with right parameter type. Sponsored by: DARPA & NAI Labs.
96755	16-May-2002	trhodes	More s/file system/filesystem/g
96572	14-May-2002	phk	Make daddr_t and u_daddr_t 64bits wide. Retire daddr64_t and use daddr_t instead. Sponsored by: DARPA & NAI Labs.
96506	13-May-2002	phk	Remove register keyword. Sponsored by: DARPA & NAI Labs. Submitted by: mckusick
96482	12-May-2002	phk	Remove two "register" and a blank line. Submitted by: mckusick Sponsored by: DARPA & NAI Labs.
96473	12-May-2002	phk	ARGH! SBLOCK is not unused. Try to get this right. BBSIZE belongs in <sys/disklabel.h> (but shouldn't be a constant). Define SBLOCK again, using the right math. Sponsored by: DARPA & NAI Labs.
96472	12-May-2002	phk	Remove #define for BBOFF, it is assumed == 0 so many places that we might as well forget about it. In fact the only thing which used it was the SBOFF macro. Sponsored by: DARPA & NAI Labs.
96471	12-May-2002	phk	Remove unused BBLOCK and SBLOCK #defines. Sponsored by: DARPA & NAI Labs.
96095	06-May-2002	alc	o Condition the compilation and use of vm_freeze_copyopts() on ENABLE_VFS_IOOPT.
96072	05-May-2002	phk	Move some UFS related stuff home where it belongs.
96010	04-May-2002	jeff	Include systm.h so panic(9) is defined when doing DEBUG_ALL_VFS_LOCKS.
95974	03-May-2002	phk	Name ufs_vop_[gs]etextattr() consistently with the rest of our VOPs and put then in the ufs_vnops where they belong, rather than in the ffs_vnops. Ok'ed by: rwatson Sponsored by: DARPA & NAI Labs.
95945	02-May-2002	phk	Use vop_panic() instead of our home-rolled version.
94996	18-Apr-2002	alfred	Remove support for using soon to be retired "special" poll(2) ops. Replace with kevent(2) ops. This is untested, but the code would rot even further if this wasn't applied. I've chosen to apply this to prompt some cleanup. Submitted by: bde
94723	15-Apr-2002	jeff	Don't peak into the malloc_type structure for limits. The desired vnodes check should be sufficient. This is required for the pending removal of malloc_type limits.
94182	08-Apr-2002	phk	Move generic disk ioctls from <sys/disklabel.h> to <sys/disk.h>. Sponsored by: DARPA & NAI Labs
93818	04-Apr-2002	jhb	Change callers of mtx_init() to pass in an appropriate lock type name. In most cases NULL is passed, but in some cases such as network driver locks (which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used. Tested on: i386, alpha, sparc64
93736	03-Apr-2002	phk	Move the FFS parameter MAXFRAG from <sys/param.h> to <ufs/ffs/fs.h> Sponsored by: DARPA & NAI Labs.
93654	02-Apr-2002	phk	Use DIOCGSECTORSIZE instead of the bogus DIOCGPART ioctl.
93593	01-Apr-2002	jhb	Change the suser() API to take advantage of td_ucred as well as do a general cleanup of the API. The entire API now consists of two functions similar to the pre-KSE API. The suser() function takes a thread pointer as its only argument. The td_ucred member of this thread must be valid so the only valid thread pointers are curthread and a few kernel threads such as thread0. The suser_cred() function takes a pointer to a struct ucred as its first argument and an integer flag as its second argument. The flag is currently only used for the PRISON_ROOT flag. Discussed on: smp@
93430	30-Mar-2002	bde	In ffs_mountffs(), set mnt_iosize_max to si_iosize_max unconditionally provided the latter is nonzero. At this point, the former is a fairly arbitrary default value (DFTPHYS), so changing it to any reasonable value specified by the device driver is safe. Using the maximum of these limits broke ffs clustered i/o for devices whose si_iosize_max is < DFLTPHYS. Using the minimum would break device drivers' ability to increase the active limit from DFTLPHYS up to MAXPHYS. Copied the code for this and the associated (unnecessary?) fixup of mp_iosize_max to all other filesystems that use clustering (ext2fs and msdosfs). It was completely missing. PR: 36309 MFC-after: 1 week
92807	20-Mar-2002	dwmalone	Two minor changes to dirhash, which result in some marginal benchmark improvements. 1) If deleting an entry results in a chain of deleted slots ending in an empty slot, then we can be a bit more aggressive about marking slots as empty. 2) The last stage of the FNV hash is to xor the last byte of data into the hash. This means that filenames which differ only in the last byte will be placed close to one another in the hash table, which forms longer chains. To work around this common case, we also hash in the address of the dirhash structure. news/cancel = news/articles/control/cancel for a tradspool inn server squid2 = squid level 2 directory (dirs called 00->FF) squid3 = squid level 3 directory (files called 00001F00->00001FFF) mean #probes for home dir mh inbox news/cancel tmp squid2 squid3 old successful 1.02 3.19 4.07 1.10 7.85 2.06 new successful 1.04 1.32 1.27 1.04 1.93 1.17 old unsuccessful 1.08 4.50 5.37 1.17 10.76 2.69 new unsuccessful 1.08 1.73 1.64 1.17 2.89 1.37 Reviewed by: iedowse MFC after: 2 weeks
92768	20-Mar-2002	jeff	Remove references to vm_zone.h and switch over to the new uma API.
92728	19-Mar-2002	alfred	Remove __P.
92640	19-Mar-2002	bde	Fixed some printf format errors (hopefully all of the remaining daddr64_t ones for GENERIC, and all others on the same line as those). Reformat the printfs if necessary to avoid new long lones or old format printf errors.
92462	17-Mar-2002	mckusick	Add a flags parameter to VFS_VGET to pass through the desired locking flags when acquiring a vnode. The immediate purpose is to allow polling lock requests (LK_NOWAIT) needed by soft updates to avoid deadlock when enlisting other processes to help with the background cleanup. For the future it will allow the use of shared locks for read access to vnodes. This change touches a lot of files as it affects most filesystems within the system. It has been well tested on FFS, loopback, and CD-ROM filesystems. only lightly on the others, so if you find a problem there, please let me (mckusick@mckusick.com) know.
92363	15-Mar-2002	mckusick	Introduce the new 64-bit size disk block, daddr64_t. Change the bio and buffer structures to have daddr64_t bio_pblkno, b_blkno, and b_lblkno fields which allows access to disks larger than a Terabyte in size. This change also requires that the VOP_BMAP vnode operation accept and return daddr64_t blocks. This delta should not affect system operation in any way. It merely sets up the necessary interfaces to allow the development of disk drivers that work with these larger disk block addresses. It also allows for the development of UFS2 which will use 64-bit block addresses.
92299	15-Mar-2002	obrien	Quiet a warning on the Alpha.
92250	14-Mar-2002	mckusick	This corrects the first of two known deadlock conditions that come from the presence of a snapshot file.
92098	11-Mar-2002	iedowse	Fix a bug in ufsdirhash_adjfree() that caused it to incorrectly update the free-space statistics in some cases. The problem affected directory blocks when the free space dropped below the size of the maximum allowed entry size. When this happened, the free-space summary information could claim that there are no further blocks that can fit a maximum-size entry, even if there are. The effect of this bug is that the directory may be enlarged even though there is space within the directory for the new entry. This wastes disk space and has a negative impact on performance. Fix it by correctly computing the dh_firstfree array index, adding a helper macro for clarity. Put an extra sanity check into ufsdirhash_checkblock() to detect the situation in future. Found by: dwmalone Reviewed by: dwmalone MFC after: 1 week
92095	11-Mar-2002	phk	I missed one VOP_CLOSE in the previous commit. Pointed out by: bde
92092	11-Mar-2002	phk	As a XXX bandaid open the mounted device READ/WRITE even if we only mount read-only. The trouble here is that we don't reopen the device in read/write mode when we remount in read/write mode resulting in a filesystem sending write requests to a device which was only opened read/only. I'm not quite sure how such a reopen would best be done and defer the problem to more agile hackers.
91825	07-Mar-2002	rwatson	Update DBA for NAI. We have several. We used the wrong one. :-)
91814	07-Mar-2002	green	Add new errno ``ENOATTR''.
91720	06-Mar-2002	dillon	cleanup readability syntax prior to ongoing b_resid work commits. MFC after: 1 day
91420	27-Feb-2002	jhb	Use thread0.td_ucred instead of proc0.p_ucred. This change is cosmetic and isn't strictly required. However, it lowers the number of false positives found when grep'ing the kernel sources for p_ucred to ensure proper locking.
91406	27-Feb-2002	jhb	Simple p_ucred -> td_ucred changes to start using the per-thread ucred reference.
91060	22-Feb-2002	phk	Replace bowrite() with BUF_WRITE in ufs. Remove bowrite(), it is now unused. This is the first step in getting entirely rid of BIO_ORDERED which is a generally accepted evil thing. Approved by: mckusick
90972	20-Feb-2002	rwatson	o Minor style fix on #endif, missing '_' in comment.
90860	18-Feb-2002	phk	Make v_addpollinfo() visible and non-inline. Have callers only call it as needed. Add necessary call in ufs_kqfilter(). Test-case found by: Andrew Gallatin <gallatin@cs.duke.edu>
90791	17-Feb-2002	phk	Move the stuff related to select and poll out of struct vnode. The use of the zone allocator may or may not be overkill. There is an XXX: over in ufs/ufs/ufs_vnops.c that jlemon may need to revisit. This shaves about 60 bytes of struct vnode which on my laptop means 600k less RAM used for vnodes.
90790	17-Feb-2002	phk	Collect the VN_KNOTE() macro definitions on vnode.h
90538	11-Feb-2002	julian	In a threaded world, differnt priorirites become properties of different entities. Make it so. Reviewed by: jhb@freebsd.org (john baldwin)
90453	10-Feb-2002	rwatson	Minor style tweaks. Remove an unneeded comment and commented out code that won't be needed.
90452	10-Feb-2002	rwatson	Copyright + license update.
90448	10-Feb-2002	rwatson	Part I: Update extended attribute API and ABI: o Modify the system call syntax for extattr_{get,set}_{fd,file}() so as not to use the scatter gather API (which appeared not to be used by any consumers, and be less portable), rather, accepts 'data' and 'nbytes' in the style of other simple read/write interfaces. This changes the API and ABI. o Modify system call semantics so that extattr_get_{fd,file}() return a size_t. When performing a read, the number of bytes read will be returned, unless the data pointer is NULL, in which case the number of bytes of data are returned. This changes the API only. o Modify the VOP_GETEXTATTR() vnode operation to accept a *size_t argument so as to return the size, if desirable. If set to NULL, the size will not be returned. o Update various filesystems (pseodofs, ufs) to DTRT. These changes should make extended attributes more useful and more portable. More commits to rebuild the system call files, as well as update userland utilities to follow. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
90438	10-Feb-2002	phk	Remove di_inumber since LFS is long gone.
90366	07-Feb-2002	mckusick	Occationally background fsck would cause a spurious ``freeing free inode'' panic. This change corrects that problem by setting the fs_active flag when the inode map changes to notify the snapshot code that the cylinder group must be rescanned. Submitted by: Robert Watson <rwatson@FreeBSD.org>
90329	07-Feb-2002	mckusick	Occationally deleted files would hang around for hours or days without being reclaimed. This bug was introduced in revision 1.95 dealing with filenames placed in newly allocated directory blocks, thus is not present in 4.X systems. The bug is triggered when a new entry is made in a directory after the data block containing the original new entry has been written, but before the inode that references the data block has been written. Submitted by: Bill Fenner <fenner@research.att.com>
90098	02-Feb-2002	mckusick	When taking a snapshot, we must check for active files that have been unlinked (e.g., with a zero link count). We have to expunge all trace of these files from the snapshot so that they are neither reclaimed prematurely by fsck nor saved unnecessarily by dump.
89680	23-Jan-2002	mckusick	Add a stub for softdep_request_cleanup() so that compilation without SOFTUPDATES option works properly. Submitted by: Benno Rice <benno@jeamland.net>
89637	22-Jan-2002	mckusick	This patch fixes a long standing complaint with soft updates in which small and/or nearly full filesystems would fail with `file system full' messages when trying to replace a number of existing files (for example during a system installation). When the allocation routines are about to fail with a file system full condition, they make a call to softdep_request_cleanup() which attempts to accelerate the flushing of pending deletion requests in an effort to free up space. In the face of filesystem I/O requests that exceed the available disk transfer capacity, the cleanup request could take an unbounded amount of time. Thus, the softdep_request_cleanup() routine will only try for tickdelay seconds (default 2 seconds) before giving up and returning a filesystem full error. Under typical conditions, the softdep_request_cleanup() routine is able to free up space in under fifty milliseconds.
89450	17-Jan-2002	mckusick	Fix a bug introduced in ffs_snapshot.c -r1.25 and fs.h -r1.26 which caused incomplete snapshots to be taken. When background fsck would run on these snapshots, the result would be files being incorrectly released which would subsequently panic the kernel with ``handle_workitem_freefile: inodedep survived'', ``handle_written_inodeblock: live inodedep'', and ``handle_workitem_remove: lost inodedep'' errors.
89413	16-Jan-2002	mckusick	Put write on read-only filesystem panic after we have weeded out block and character devices, fifo's, etc. Submitted by: Bruce Evans <bde@zeta.org.au>
89384	15-Jan-2002	mckusick	When downgrading a filesystem from read-write to read-only, operations involving file removal or file update were not always being fully committed to disk. The result was lost files or corrupted file data. This change ensures that the filesystem is properly synced to disk before the filesystem is down-graded. This delta also fixes a long standing bug in which a file open for reading has been unlinked. When the last open reference to the file is closed, the inode is reclaimed by the filesystem. Previously, if the filesystem had been down-graded to read-only, the inode could not be reclaimed, and thus was lost and had to be later recovered by fsck. With this change, such files are found at the time of the down-grade. Normally they will result in the filesystem down-grade failing with `device busy'. If a forcible down-grade is done, then the affected files will be revoked causing the inode to be released and the open file descriptors to begin failing on attempts to read. Submitted by: "Sam Leffler" <sam@errno.com>
89306	13-Jan-2002	alfred	SMP Lock struct file, filedesc and the global file list. Seigo Tanimura (tanimura) posted the initial delta. I've polished it quite a bit reducing the need for locking and adapting it for KSE. Locks: 1 mutex in each filedesc protects all the fields. protects "struct file" initialization, while a struct file is being changed from &badfileops -> &pipeops or something the filedesc should be locked. 1 mutex in each struct file protects the refcount fields. doesn't protect anything else. the flags used for garbage collection have been moved to f_gcflag which was the FILLER short, this doesn't need locking because the garbage collection is a single threaded container. could likely be made to use a pool mutex. 1 sx lock for the global filelist. struct file * fhold(struct file fp); / increments reference count on a file / struct file fhold_locked(struct file fp); / like fhold but expects file to locked / struct file ffind_hold(struct thread , int fd); / finds the struct file in thread, adds one reference and returns it unlocked / struct file ffind_lock(struct thread , int fd); / ffind_hold, but returns file locked */ I still have to smp-safe the fget cruft, I'll get to that asap.
89295	12-Jan-2002	mckusick	When going to sleep, we must save our SPL so that it does not get lost if some other process uses the lock while we are sleeping. We restore it after we have slept. This functionality is provided by a new routine interlocked_sleep() that wraps the interlocking with functions that sleep. This function is then used in place of the old ACQUIRE_LOCK_INTERLOCKED() and FREE_LOCK_INTERLOCKED() macros. Submitted by: Debbie Chu <dchu@juniper.net>
89270	11-Jan-2002	mckusick	Must call drain_output() before checking the dirty block list in softdep_sync_metadata(). Otherwise we may miss dependencies that need to be flushed which will result in a later panic with the message ``vinvalbuf: dirty bufs''. Submitted by: Matthew Dillon <dillon@apollo.backplane.com> MFC after: 1 week
89213	10-Jan-2002	phk	Do not pull quota entries of the cache-list if they have already been removed from the cache-list as part of a previous unmount. This would result in panics (page fault in dqflush()) during subsequent umounts provided that enough distinct UID's to actually make the hash do something are active. This can probably explain a number of weird quota related behaviours. PR: 32331 maybe more. Reproduced by: Søren Schrørder <sch@cybercity.dk>
89089	08-Jan-2002	msmith	Initialise the bioops vector hack at runtime rather than at link time. This avoids the use of common variables. Reviewed by: mckusick
88318	20-Dec-2001	dillon	Fix a BUF_TIMELOCK race against BUF_LOCK and fix a deadlock in vget() against VM_WAIT in the pageout code. Both fixes involve adjusting the lockmgr's timeout capability so locks obtained with timeouts do not interfere with locks obtained without a timeout. Hopefully MFC: before the 4.5 release
88138	18-Dec-2001	mckusick	Change the atomic_set_char to atomic_set_int and atomic_clear_char to atomic_clear_int to ease the implementation for the sparc64. Requested by: Jake Burkholder <jake@locore.ca>
88026	16-Dec-2001	iedowse	Make sure we ignore the value of `fs_active' when reloading the superblock, and move the initialisation of it to beside where other pointer fields are initialised.
88025	16-Dec-2001	iedowse	Move the new superblock field `fs_active' into the region of the superblock that is already set up to handle pointer types. This fixes an accidental change in the superblock size on 64-bit platforms caused by revision 1.24.
87827	14-Dec-2001	mckusick	Minimize the time necessary to suspend operations on a filesystem when taking a snapshot. The two time consuming operations are scanning all the filesystem bitmaps to determine which blocks are in use and scanning all the other snapshots so as to be able to expunge their blocks from the view of the current snapshot. The bitmap scanning is broken into two passes. Before suspending the filesystem all bitmaps are scanned. After the suspension, those bitmaps that changed after being scanned the first time are rescanned. Typically there are few bitmaps that need to be rescanned. The expunging of other snapshots is now done after the suspension is released by observing that we can easily identify any blocks that were allocated to them after the suspension (they will be maked as `not needing to be copied' in the just created snapshot). For all the gory details, see the ``Running fsck in the Background'' paper in the Usenix BSDCon 2002 Conference Proceedings, pages 55-64.
87782	13-Dec-2001	mckusick	When a file is partially truncated, we first check to see if the new file end will land in the middle of a file hole. Since the last block of a file must always be allocated, the hole is filled by allocating a block at that location. If the hole being filled is a direct block, then the truncation may eventually reduce the full sized block down to a fragment. When running with soft updates, it is necessary to FSYNC the file after allocating the block and before creating the fragment to avoid triggering a soft updates inconsistency when the block unexpectedly shrinks. Found by: Matthew Dillon <dillon@apollo.backplane.com> MFC after: 1 week
87133	30-Nov-2001	rwatson	Use 'mkdir -p /.attribute/system' instead of breaking it into two seperate mkdir targets. Submitted by: jedgar
87132	30-Nov-2001	rwatson	Use 'mkdir -p /.attribute/system' instead of breaking it into two seperate mkdir targets.
87131	30-Nov-2001	rwatson	README.extattr incorrectly specified sample command lines for UFS_EXTATTR_AUTOSTART. Insert the missing 'initattr' arguments to extattrctl. Noticed by: green
86782	22-Nov-2001	guido	When mkdir()-ing, the parent dir gets is linkcount increased. Fix VN_KNOTE to reflect that. Found by: tobez@freebsd.org MFC after: 2 days
86350	14-Nov-2001	iedowse	Oops, when trying the dirhash sequential-access optimisation, compare the slot offset against the predicted offset, not a boolean flag. This typo effectively disabled the sequential optimisation, but was otherwise harmless. Not surprisingly, fixing this improves performance in the sequential access case. I am seeing a 7% speedup on one machine here; using dirhash when sequentially looking up directory entries is now about 5% faster instead of 2% slower than the non-dirhash case. Submitted by: KOIE Hidetaka <koie@suri.co.jp> MFC after: 1 week
86089	05-Nov-2001	dillon	Implement IO_NOWDRAIN and B_NOWDRAIN - prevents the buffer cache from blocking in wdrain during a write. This flag needs to be used in devices whos strategy routines turn-around and issue another high level I/O, such as when MD turns around and issues a VOP_WRITE to vnode backing store, in order to avoid deadlocking the dirty buffer draining code. Remove a vprintf() warning from MD when the backing vnode is found to be in-use. The syncer of buf_daemon could be flushing the backing vnode at the time of an MD operation so the warning is not correct. MFC after: 1 week
85845	01-Nov-2001	rwatson	o Update copyright dates. o Add reference to TrustedBSD Project in license header. o Update dated comments, including comment in extattr.h claiming that no file systems support extended attributes. o Improve comment consistency.
85581	27-Oct-2001	rwatson	o Althought this is not specified in POSIX.1e, the UFS ACL implementation coerces the deletion of a default ACL on a directory when no default ACL EA is present to success. Because the UFS EA implementation doesn't disinguish the EA failure modes "that EA name has not been administratively enabled" from "that EA name has no defined data", there's a potential conflict in error return values. Normally, the lack of administratively configured EA support is coerced to EOPNOTSUPP to indicate that ACLs are not available; in this case, it is possible to get a successful return, even if ACLs are not available because EA support for them has not been enabled. Expand the comment in ufs_setacl() to identify this case. Obtained from: TrustedBSD Project
85580	27-Oct-2001	rwatson	o Clarify a comment about the locking condition of the vnode upon exit from ufs_extattr_enable_with_open(). o Print auto-start notifications if (bootverbose). This was previously commented out since it didn't know how to check for bootverbose. o Drop in comments throughout indicating where ENOENT should be replaced with ENOATTR once that is available. Obtained from: TrustedBSD Project
85579	27-Oct-2001	rwatson	o The comment about ordering the destruction of the lock and the removal of the flag indicating that the structure was initialized didn't need an XXX, since it didn't need fixing. Obtained from: TrustedBSD Project
85578	27-Oct-2001	rwatson	o Wrap a number of long lines of code, many of which were introduced due to KSE-related (p) expansions. Obtained from: TrustedBSD Project
85577	27-Oct-2001	rwatson	Since namespace support was added to the UFS extended attribute implementation to replace single-character namespace prefixes, '$' is no longer an invalid attribute name, and the namespace is relevant to validity determination. o Remove '$' case from ufs_extattr_valid_attrname() o Add attrnamespace argument to ufs_extattr_valid_attrname(), and fill out appropriately. Currently no decisions are made based on the namespace argument, but may be in the future. Obtained from: TrustedBSD Project
85517	26-Oct-2001	dillon	Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a real effect. Optimize vfs_msync(). Avoid having to continually drop and re-obtain mutexes when scanning the vnode list. Improves looping case by 500%. Optimize ffs_sync(). Avoid having to continually drop and re-obtain mutexes when scanning the vnode list. This makes a couple of assumptions, which I believe are ok, in regards to vnode stability when the mount list mutex is held. Improves looping case by 500%. (more optimization work is needed on top of these fixes) MFC after: 1 week
85512	25-Oct-2001	iedowse	Default to not performing ufs_dirhash's extensive directory-block sanity check after every directory modification. This check can be re-enabled at any time by setting the sysctl "vfs.ufs.dirhash_docheck" to 1. This group of sanity tests was there to ensure that any UFS_DIRHASH bugs could be caught by a panic before a potentially corrupted directory block would be written to disk. It has served its main purpose now, so disable it in the interest of performance. MFC after: 1 week
85339	23-Oct-2001	dillon	Change the vnode list under the mount point from a LIST to a TAILQ in preparation for an implementation of limiting code for kern.maxvnodes. MFC after: 3 days
84827	11-Oct-2001	jhb	Change the kernel's ucred API as follows: - crhold() returns a reference to the ucred whose refcount it bumps. - crcopy() now simply copies the credentials from one credential to another and has no return value. - a new crshared() primitive is added which returns true if a ucred's refcount is > 1 and false (0) otherwise.
84811	11-Oct-2001	jhb	Add missing includes of sys/lock.h.
84642	08-Oct-2001	dillon	Remove panics for rename() race conditions. The panics are inappropriate because the IN_RENAME flag only fixes a few of the huge number of race conditions that can result in the source path becoming invalid even prior to the VOP_RENAME() call. The panics created a serious security issue whereby an attacker could fairly easily cause the panic to occur, crashing the machine. The correct solution requires a great deal of work in the namei path cache code. MFC after: 0 days
84374	02-Oct-2001	rwatson	o Replace two direct uid!=0 comparisons with suser_xxx() calls. Obtained from: TrustedBSD Project
84373	02-Oct-2001	rwatson	o Replace two direct uid!=0 comparisons with suser_td() calls. Obtained from: TrustedBSD Project
84344	02-Oct-2001	dillon	Backout the last commit. The problem is actually much worse then I first thought and may require serious work to the VOP_RENAME() api itself. Basically, by the time the VOP_RENAME() function is called, it's already too late.
84339	02-Oct-2001	dillon	IN_RENAME should only be cleared by the routine that set it. This fixes a rename/rmdir race that has been shown to cause a panic. Bug reported by: Yevgeniy Aleynikov <eugenea@infospace.com> MFC after: 3 days
84050	27-Sep-2001	jhb	- Fix some minor whitespace nits. - Move the SPECIAL_FLAG #define up next to the NOHOLDER #define and fix a little nit that caused it to be defined as -(sizeof (struct thread) + 1) instead of -2.
83992	26-Sep-2001	rwatson	o Re-enable support of system file flags in jail() by adding back the PRISON_ROOT to the suser_xxx() check. Since securelevels may now be raised in specific jails, use of system flags can still be restricted in jail(), but in a more configurable way. o Users of jail() expecting system flags (such as schg) to restrict jail()'s should be sure to set the securelevel appropriately in jail()'s. o This fixes activities involving automated system flag removal in jail(), including installkernel and friends. Obtained from: TrustedBSD Project
83987	26-Sep-2001	rwatson	o Modify ufs_setattr() so that it uses securelevel_gt() instead of direct variable access. Obtained from: TrustedBSD Project
83924	25-Sep-2001	rwatson	o Further clarify comment: ad Udo's request, re-insert the 'if' refering to securelevels; also, update the unprivileged process text to better indicate the scope of actions permittable when any system flags are already set (limited). Submitted by: Udo Schweigert <udo.schweigert@siemens.com>
83918	25-Sep-2001	rwatson	o Parallelize the comment on the relationship between privileged un-jailed processes and the actual securelevel check: make the comment use '> 0' instead of inverted '<= 0'.
83899	24-Sep-2001	iedowse	The addition of i_dirhash to struct inode pushed RELENG_4's sizeof(struct inode) into a new malloc bucket on the i386. This didn't happen in -current due to the removal of i_lock, but it does no harm to apply the workaround to -current first. Reduce the size of the i_spare[] array in struct inode from 4 to 3 entries, and change ext2fs to use i_din.di_spare[1] so that it does not need i_spare[3]. Reviewed by: bde MFC after: 3 days
83366	12-Sep-2001	julian	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
83263	09-Sep-2001	iedowse	The "dirpref" directory layout preference improvements make use of an array "fs_contigdirs[]" to avoid too many directories getting created in each cylinder group. The memory required for this and two other arrays (fs_csp[] and fs_maxcluster[]) is allocated with a single malloc() call, and divided up afterwards. However, the 'space' pointer is not advanced correctly, so fs_contigdirs and fs_maxcluster end up pointing to the same address. Add the missing code to advance the 'space' pointer, and remove an unnecessary update of the pointer that follows. This is likely to fix the "ffs_clusteralloc: map mismatch" panics that have been reported recently. Submitted by: Luke Mewburn <lukem@wasabisystems.com>
82770	01-Sep-2001	jedgar	Use ACL_PERM_NONE instead of hardcoding 0 when initializing ACL entry permissions. Reviewed by: rwatson
82755	01-Sep-2001	rwatson	o At some point, unmounting a non-EA file system with EA's compiled in got a bit broken, when ufs_extattr_stop() was called and failed, ufs_extattr_destroy() would panic. This makes the call to destroy() conditional on the success of stop(). Submitted by: Christian Carstensen <cc@devcon.net> Obtained from: TrustedBSD Project
82395	27-Aug-2001	peter	If a file has been completely unlinked, stop automatically syncing the file. ffs will discard any pending dirty pages when it is closed, so we may as well not waste time trying to clean them. This doesn't stop other things from writing it out, eg: pageout, fsync(2) etc.
82364	26-Aug-2001	iedowse	Stop using dirhash when a directory is removed, and ensure that we never attempt to hash directories once they are deleted. This fixes a problem where operations on a deleted directory could trigger dirhash sanity panics.
82334	26-Aug-2001	iedowse	When compacting directories, ufs_direnter() always trusted DIRSIZ() to supply the number of bytes to be bcopy()'d to move an entry. If d_ino == 0 however, DIRSIZ() is not guaranteed to return a sensible length, so ufs_direnter could end up corrupting a directory during compaction. In practice I believe this can only happen after fsck_ffs has fixed a previously-corrupted directory. We now deal with any mid-block unused entries specially to avoid using DIRSIZ() or bcopy() on such entries. We also ensure that the variables 'dsize' and 'spacefree' contain meaningful values at all times. Add a few comments to describe better this intricate piece of code. The special handling of mid-block unused entries makes the dirhash- specific bugfix in the previous revision (1.53) now uncecessary, so this change removes it. Reviewed by: mckusick
82124	22-Aug-2001	iedowse	When compressing directory blocks, the dirhash code didn't check that the directory entry was in use before attempting to find it in the hash structures to change its offset. Normally, unused entries do not need to be moved, but fsck can leave behind some unused entries that do. A dirhash sanity panic resulted when the entry to be moved was not found. Add a check that stops entries with d_ino == 0 from being passed to ufsdirhash_move().
81877	18-Aug-2001	peter	Sigh. ufs_lookup() calls ffs_snapgone(), meaning that 'options EXT2FS' without 'options FFS' would fail to link.
80554	29-Jul-2001	iedowse	Two recent commits in sys/ufs/ufs interacted badly with ext2fs because it shares ufs code. In ufs_fhtovp(), the test on i_effnlink is invalid because ext2fs does not maintain this field. In ufs_close(), i_effnlink is also tested, to determines whether or not to call vn_start_write(). The ufs_fhtovp issue breaks NFS exporting of ext2fs filesystems; I believe the other is harmless. Fix both cases by checking um_i_effnlink_valid in the ufsmount struct, and use i_nlink if necessary. Noticed by: bde Reviewed by: mckusick, bde
80456	27-Jul-2001	iedowse	Disable the dirhash sanity check that panics if an unused directory entry (d_ino == 0) is found in a position that is not the start of a DIRBLKSIZ block. While such entries cannot occur normally (ufs always extends the previous entry to cover the free space instead), they do not cause problems and fsck does not fix them, so panicking is bad.
79769	16-Jul-2001	peter	Use a fixed type for times in on-disk structures for ufs rather than something that could potentially change like time_t.
79690	13-Jul-2001	iedowse	Return a locked struct buf from ufsdirhash_lookup() to avoid one extra getblk/brelse sequence for each lookup. We already had this buf in ufsdirhash_lookup(), so there was no point in brelse'ing it only to have the caller immediately reaquire the same buffer. This should make the case of sequential lookups marginally faster; in my tests, sequential lookups with dirhash enabled are now only around 1% slower than without dirhash.
79561	10-Jul-2001	iedowse	Bring in dirhash, a simple hash-based lookup optimisation for large directories. When enabled via "options UFS_DIRHASH", in-core hash arrays are maintained for large directories. These allow all directory operations to take place quickly instead of requiring long linear searches. For now anyway, dirhash is not enabled by default. The in-core hash arrays have a memory requirement that is approximately half the size of the size of the on-disk directory file. A number of new sysctl variables allow control over which directories get hashed and over the maximum amount of memory that dirhash will use: vfs.ufs.dirhash_minsize The minimum on-disk directory size for which hashing should be used. The default is 2560 (2.5k). vfs.ufs.dirhash_maxmem The system-wide maximum total memory to be used by dirhash data structures. The default is 2097152 (2MB). The current amount of memory being used by dirhash is visible through the read-only sysctl variable vfs.ufs.dirhash_maxmem. Finally, some extra sanity checks that are enabled by default, but which may have an impact on performance, can be disabled by setting vfs.ufs.dirhash_docheck to 0. Discussed on: -fs, -hackers
79224	04-Jul-2001	dillon	With Alfred's permission, remove vm_mtx in favor of a fine-grained approach (this commit is just the first stage). Also add various GIANT_ macros to formalize the removal of Giant, making it easy to test in a more piecemeal fashion. These macros will allow us to test fine-grained locks to a degree before removing Giant, and also after, and to remove Giant in a piecemeal fashion via sysctl's on those subsystems which the authors believe can operate without Giant.
78940	28-Jun-2001	jhb	Fix more mntvnode and vnode interlock order reversals.
78912	28-Jun-2001	jhb	- Fix a mntvnode and vnode interlock reversal. - Protect the mnt_vnode list with the mntvnode lock. - Use queue(9) macros.
78256	15-Jun-2001	peter	Fix warning: 1973: warning: int format, long int arg (arg 5)
78191	13-Jun-2001	mckusick	Build on the change in revision 1.98 by Tor.Egge@fast.no. The symptom being treated in 1.98 was to avoid freeing a pagedep dependency if there was still a newdirblk dependency referencing it. That change is correct and no longer prints a warning message when it occurs. The other part of revision 1.98 was to panic when a newdirblk dependency was encountered during a file truncation. This fix removes that panic and replaces it with code to find and delete the newdirblk dependency so that the truncation can succeed.
77847	07-Jun-2001	tmm	Call vn_close on the backing file vnode if ufs_extattr_enable failed to avoid leaking it. Reviewed by: rwatson
77822	06-Jun-2001	jlemon	Add a wrapper for the fifo kqfilter which falls through to the ufs routine. This permits the fifo to inherit the ufs VNODE kqfilter.
77762	05-Jun-2001	jlemon	Add a kqueue filter for writing to ufs filesystems which always returns true. This permits better interoperability with programs which register filters on their stdin/stdout handles. Submitted by: Niels Provos <provos@citi.umich.edu>
77743	05-Jun-2001	obrien	There seems to be a problem that the order of disk write operation being incorrect due to a missing check for some dependency. This change avoids the freelist corruption (but not the temporarily inconsistent state of the file system). A message is printed as a reminder of the under lying problem when a pagedep structure is not freed due to the NEWBLOCK flag being set. Submitted by: Tor.Egge@fast.no
77509	30-May-2001	jhb	Revert the previous commit in favor of the fix in rev 1.42 of ufs/ffs/ffs_extern.h instead. Requested by: bde
77508	30-May-2001	jhb	Forward declare struct cg to quiet a warning. Submitted by: bde
77445	29-May-2001	jhb	Include <ufs/ffs/fs.h> to get the definition of struct cg to quiet a warning.
77437	29-May-2001	phk	Remove last vestiges of MFS.
77417	29-May-2001	phk	Remove MFS from the kernel.
77190	25-May-2001	tmm	Add a check to determine whether extended attributes have been initialized on the file system before trying to grab the lock of the per-mount extattr structure, as this lock is unitialized in that case. This is needed because ufs_extattr_vnode_inactive is called from ufs_inactive, which is also used by EA-unaware file systems such as ext2fs. Reviewed by: rwatson
77183	25-May-2001	rwatson	o Merge contents of struct pcred into struct ucred. Specifically, add the real uid, saved uid, real gid, and saved gid to ucred, as well as the pcred->pc_uidinfo, which was associated with the real uid, only rename it to cr_ruidinfo so as not to conflict with cr_uidinfo, which corresponds to the effective uid. o Remove p_cred from struct proc; add p_ucred to struct proc, replacing original macro that pointed. p->p_ucred to p->p_cred->pc_ucred. o Universally update code so that it makes use of ucred instead of pcred, p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo, cr_{r,sv}{u,g}id instead of p_*, etc. o Remove pcred0 and its initialization from init_main.c; initialize cr_ruidinfo there. o Restruction many credential modification chunks to always crdup while we figure out locking and optimizations; generally speaking, this means moving to a structure like this: newcred = crdup(oldcred); ... p->p_ucred = newcred; crfree(oldcred); It's not race-free, but better than nothing. There are also races in sys_process.c, all inter-process authorization, fork, exec, and exit. o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid; remove comments indicating that the old arrangement was a problem. o Restructure exec1() a little to use newcred/oldcred arrangement, and use improved uid management primitives. o Clean up exit1() so as to do less work in credential cleanup due to pcred removal. o Clean up fork1() so as to do less work in credential cleanup and allocation. o Clean up ktrcanset() to take into account changes, and move to using suser_xxx() instead of performing a direct uid==0 comparision. o Improve commenting in various kern_prot.c credential modification calls to better document current behavior. In a couple of places, current behavior is a little questionable and we need to check POSIX.1 to make sure it's "right". More commenting work still remains to be done. o Update credential management calls, such as crfree(), to take into account new ruidinfo reference. o Modify or add the following uid and gid helper routines: change_euid() change_egid() change_ruid() change_rgid() change_svuid() change_svgid() In each case, the call now acts on a credential not a process, and as such no longer requires more complicated process locking/etc. They now assume the caller will do any necessary allocation of an exclusive credential reference. Each is commented to document its reference requirements. o CANSIGIO() is simplified to require only credentials, not processes and pcreds. o Remove lots of (p_pcred==NULL) checks. o Add an XXX to authorization code in nfs_lock.c, since it's questionable, and needs to be considered carefully. o Simplify posix4 authorization code to require only credentials, not processes and pcreds. Note that this authorization, as well as CANSIGIO(), needs to be updated to use the p_cansignal() and p_cansched() centralized authorization routines, as they currently do not take into account some desirable restrictions that are handled by the centralized routines, as well as being inconsistent with other similar authorization instances. o Update libkvm to take these changes into account. Obtained from: TrustedBSD Project Reviewed by: green, bde, jhb, freebsd-arch, freebsd-audit
77115	24-May-2001	dillon	This patch implements O_DIRECT about 80% of the way. It takes a patchset Tor created a while ago, removes the raw I/O piece (that has cache coherency problems), and adds a buffer cache / VM freeing piece. Essentially this patch causes O_DIRECT I/O to not be left in the cache, but does not prevent it from going through the cache, hence the 80%. For the last 20% we need a method by which the I/O can be issued directly to buffer supplied by the user process and bypass the buffer cache entirely, but still maintain cache coherency. I also have the code working under -stable but the changes made to sys/file.h may not be MFCable, so an MFC is not on the table yet. Submitted by: tegge, dillon
77037	23-May-2001	alfred	ufs_bmaparray() may block on IO, drop vm mutex and aquire Giant when calling it from the pager routine
77031	23-May-2001	ru	- FDESC, FIFO, NULL, PORTAL, PROC, UMAP and UNION file systems were repo-copied from sys/miscfs to sys/fs. - Renamed the following file systems and their modules: fdesc -> fdescfs, portal -> portalfs, union -> unionfs. - Renamed corresponding kernel options: FDESC -> FDESCFS, PORTAL -> PORTALFS, UNION -> UNIONFS. - Install header files for the above file systems. - Removed bogus -I${.CURDIR}/../../sys CFLAGS from userland Makefiles.
76900	20-May-2001	mckusick	Update softdep_setup_directory_add prototype to reflect changes in actual function. Obtained from: Jim Bloom <bloom@jbloom.jbloom.org>
76859	19-May-2001	mckusick	Must ensure that all the entries on the pd_pendinghd list have been committed to disk before clearing them. More specifically, when free_newdirblk is called, we know that the inode claims the new directory block. However, if the associated pagedep is still linked onto the directory buffer dependency chain, then some of the entries on the pd_pendinghd list may not be committed to disk yet. In this case, we will simply note that the inode claims the block and let the pd_pendinghd list be processed when the pagedep is next written. If the pagedep is no longer on the buffer dependency chain, then all the entries on the pd_pending list are committed to disk and we can free them in free_newdirblk. This corrects a window of vulnerability introduced in the code added in version 1.95.
76827	19-May-2001	alfred	Introduce a global lock for the vm subsystem (vm_mtx). vm_mtx does not recurse and is required for most low level vm operations. faults can not be taken without holding Giant. Memory subsystems can now call the base page allocators safely. Almost all atomic ops were removed as they are covered under the vm mutex. Alpha and ia64 now need to catch up to i386's trap handlers. FFS and NFS have been tested, other filesystems will need minor changes (grabbing the vm lock when twiddling page properties). Reviewed (partially) by: jake, jhb
76825	18-May-2001	mckusick	Must be a bit less aggressive about freeing pagedep structures. Obtained from: Robert Watson <rwatson@FreeBSD.org> and Matthew Jacob <mjacob@feral.com>
76724	17-May-2001	mckusick	When a new block is allocated to a directory, an fsync of a file whose name is within that block must ensure not only that the block containing the file name has been written, but also that the on-disk directory inode references that block. When a new directory block is created, we allocate a newdirblk structure which is linked to the associated allocdirect (on its ad_newdirblk list). When the allocdirect has been satisfied, the newdirblk structure is moved to the inodedep id_bufwait list of its directory to await the inode being written. When the inode is written, the directory entries are fully committed and can be deleted from their pagedep->id_pendinghd and inodedep->id_pendinghd lists.
76688	16-May-2001	iedowse	Change the second argument of vflush() to an integer that specifies the number of references on the filesystem root vnode to be both expected and released. Many filesystems hold an extra reference on the filesystem root vnode, which must be accounted for when determining if the filesystem is busy and then released if it isn't busy. The old `skipvp' approach required individual filesystem xxx_unmount functions to re-implement much of vflush()'s logic to deal with the root vnode. All 9 filesystems that hold an extra reference on the root vnode got the logic wrong in the case of forced unmounts, so `umount -f' would always fail if there were any extra root vnode references. Fix this issue centrally in vflush(), now that we can. This commit also fixes a vnode reference leak in devfs, which could result in idle devfs filesystems that refuse to unmount. Reviewed by: phk, bp
76580	14-May-2001	mckusick	Further fixes for deadlock in the presence of multiple snapshots. There are still more to find, but this fix should cover the common cases that folks are hitting.
76557	13-May-2001	mckusick	If the effective link count is zero when an NFS file handle request comes in for it, the file is really gone, so return ESTALE. The problem arises when the last reference to an FFS file is released because soft-updates may delay the actual freeing of the inode for some time. Since there are no filesystem links or open file descriptors referencing the inode, from the point of view of the system, the file is inaccessible. However, if the filesystem is NFS exported, then the remote client can still access the inode via ufs_fhtovp() until the inode really goes away. To prevent this anomoly, it is necessary to begin returning ESTALE at the same time that the file ceases to be accessible to the local filesystem. Obtained from: Ian Dowse <iedowse@maths.tcd.ie>
76458	11-May-2001	mckusick	Remove yet another deadlock case.
76357	08-May-2001	mckusick	When running with soft updates, track the number of blocks and files that are committed to being freed and reflect these blocks in the counts returned by statfs (and thus also by the `df' command). This change allows programs such as those that do news expiration to know when to stop if they are trying to create a certain percentage of free space. Note that this change does not solve the much harder problem of making this to-be-freed space available to applications that want it (thus on a nearly full filesystem, you may still encounter out-of-space conditions even though the free space will show up eventually). Hopefully this harder problem will be the subject of a future enhancement.
76356	08-May-2001	mckusick	Several fixes for units errors: 1) Do not assume that the superblock will be of size fs->fs_bsize. This fixes a panic when taking a snapshot on a filesystem with a block size bigger than 8K. 2) Properly calculate the number of fragments that follow the superblock summary information. This fixes a bug with inconsistent snapshots. 3) When cleaning up a snapshot that is about to be removed, properly calculate the number of blocks that need to be checked. This fixes a bug that created partially allocated inodes. 4) When moving blocks from a snapshot that is about to be removed to another snapshot, properly account for the reduced number of blocks in the snapshot from which they are taken. This fixes a bug in which the number of blocks released from a snapshot did not match the number that it claimed to have.
76354	08-May-2001	mckusick	When syncing out snapshot metadata, we must temporarily allow recursive buffer locking so as to avoid locking against ourselves if we need to write filesystem metadata.
76269	04-May-2001	mckusick	Refinement to revision 1.16 of ufs/ffs/ffs_snapshot.c to reduce the amount of time that the filesystem must be suspended. The current snapshot is elided as well as the earlier snapshots.
76174	01-May-2001	phk	Use ufs_bmaparray() rather than VOP_BMAP() on our own vnodes.
76173	01-May-2001	phk	Remove blatantly pointless call to VOP_BMAP(). Use ufs_bmaparray() rather than VOP_BMAP() on our own vnodes.
76167	01-May-2001	phk	Implement vop_std{get\|put}pages() and add them to the default vop[]. Un-copy&paste all the VOP_{GET\|PUT}PAGES() functions which do nothing but the default.
76166	01-May-2001	markm	Undo part of the tangle of having sys/lock.h and sys/mutex.h included in other "system" header files. Also help the deprecation of lockmgr.h by making it a sub-include of sys/lock.h and removing sys/lockmgr.h form kernel .c files. Sort sys/*.h includes where possible in affected files. OK'ed by: bde (with reservations)
76132	29-Apr-2001	phk	VOP_BALLOC was never really a VOP in the first place, so convert it to UFS_BALLOC like the other "between UFS and FFS function interfaces".
76131	29-Apr-2001	phk	Add a vop_stdbmap(), and make it part of the default vop vector. Make 7 filesystems which don't really know about VOP_BMAP rely on the default vector, rather than more or less complete local vop_nopbmap() implementations.
76129	29-Apr-2001	phk	Call ufs_bmaparray() directly instead of indirectly via VOP_BMAP().
76128	29-Apr-2001	phk	Remove two unused arguments from ufs_bmaparray().
76127	29-Apr-2001	phk	Remove faint traces of blind copy&paste.
76126	29-Apr-2001	phk	Remove faint traces of non-existant ffs_bmap().
76117	29-Apr-2001	grog	Revert consequences of changes to mount.h, part 2. Requested by: bde
75993	26-Apr-2001	mckusick	Rather than copying all the indirect blocks of the snapshot, simply mark them as BLK_NOCOPY. This trick cuts the initial size of the snapshot in half and cuts the time to take a snapshot by a third.
75943	25-Apr-2001	mckusick	When closing the last reference to an unlinked file, it is freed by the inactive routine. Because the freeing causes the filesystem to be modified, the close must be held up during periods when the filesystem is suspended. For snapshots to be consistent across crashes, they must write blocks that they copy and claim those written blocks in their on-disk block pointers before the old blocks that they referenced can be allowed to be written. Close a loophole that allowed unwritten blocks to be skipped when doing ffs_sync with a request to wait for all I/O activity to be completed.
75934	25-Apr-2001	phk	Move the netexport structure from the fs-specific mountstructure to struct mount. This makes the "struct netexport *" paramter to the vfs_export and vfs_checkexport interface unneeded. Consequently that all non-stacking filesystems can use vfs_stdcheckexp(). At the same time, make it a pointer to a struct netexport in struct mount, so that we can remove the bogus AF_MAX and #include <net/radix.h> from <sys/mount.h>
75892	24-Apr-2001	iedowse	Pre-dirpref versions of fsck may zero out the new superblock fields fs_contigdirs, fs_avgfilesize and fs_avgfpdir. This could cause panics if these fields were zeroed while a filesystem was mounted read-only, and then remounted read-write. Add code to ffs_reload() which copies the fs_contigdirs pointer from the previous superblock, and reinitialises fs_avgf* if necessary. Reviewed by: mckusick
75858	23-Apr-2001	grog	Correct #includes to work with fixed sys/mount.h.
75580	17-Apr-2001	phk	This patch removes the VOP_BWRITE() vector. VOP_BWRITE() was a hack which made it possible for NFS client side to use struct buf with non-bio backing. This patch takes a more general approach and adds a bp->b_op vector where more methods can be added. The success of this patch depends on bp->b_op being initialized all relevant places for some value of "relevant" which is not easy to determine. For now the buffers have grown a b_magic element which will make such issues a tiny bit easier to debug.
75573	17-Apr-2001	mckusick	Add debugging option to always read/write cylinder groups as full sized blocks. To enable this option, use: `sysctl -w debug.bigcgs=1'. Add debugging option to disable background writes of cylinder groups. To enable this option, use: `sysctl -w debug.dobkgrdwrite=0'. These debugging options should be tried on systems that are panicing with corrupted cylinder group maps to see if it makes the problem go away. The set of panics in question are: ffs_clusteralloc: map mismatch ffs_nodealloccg: map corrupted ffs_nodealloccg: block not in map ffs_alloccg: map corrupted ffs_alloccg: block not in map ffs_alloccgblk: cyl groups corrupted ffs_alloccgblk: can't find blk in cyl ffs_checkblk: partially free fragment The following panics are less likely to be related to this problem, but might be helped by these debugging options: ffs_valloc: dup alloc ffs_blkfree: freeing free block ffs_blkfree: freeing free frag ffs_vfree: freeing free inode If you try these options, please report whether they helped reduce your bitmap corruption panics to Kirk McKusick at <mckusick@mckusick.com> and to Matt Dillon <dillon@earth.backplane.com>.
75572	17-Apr-2001	mckusick	Background fsck sysctl operations must use vn_start_write and vn_finished_write so that they do not attempt to modify a suspended filesystem.
75571	17-Apr-2001	rwatson	In my first reading of POSIX.1e, I misinterpreted handling of the ACL_USER_OBJ and ACL_GROUP_OBJ fields, believing that modification of the access ACL could be used by privileged processes to change file/directory ownership. In fact, this is incorrect; ACL_*_OBJ (+ ACL_MASK and ACL_OTHER) should have undefined ae_id fields; this commit attempts to correct that misunderstanding. o Modify arguments to vaccess_acl_posix1e() to accept the uid and gid associated with the vnode, as those can no longer be extracted from the ACL passed as an argument. Perform all comparisons against the passed arguments. This actually has the effect of simplifying a number of components of this call, as well as reducing the indent level, but now seperates handling of ACL_GROUP_OBJ from ACL_GROUP. o Modify acl_posix1e_check() to return EINVAL if the ae_id field of any of the ACL_{USER_OBJ,GROUP_OBJ,MASK,OTHER} entries is a value other than ACL_UNDEFINED_ID. As a temporary work-around to allow clean upgrades, set the ae_id field to ACL_UNDEFINED_ID before each check so that this cannot cause a failure in the short term (this work-around will be removed when the userland libraries and utilities are updated to take this change into account). o Modify ufs_sync_acl_from_inode() so that it forces ACL_{USER_OBJ,GROUP_OBJ,MASK,OTHER} ae_id fields to ACL_UNDEFINED_ID when synchronizing the ACL from the inode. o Modify ufs_sync_inode_from_acl to not propagate uid and gid information to the inode from the ACL during ACL update. Also modify the masking of permission bits that may be set from ALLPERMS to (S_IRWXU\|S_IRWXG\|S_IRWXO), as ACLs currently do not carry none-ACCESSPERMS (S_ISUID, S_ISGID, S_ISTXT). o Modify ufs_getacl() so that when it emulates an access ACL from the inode, it initializes the ae_id fields to ACL_UNDEFINED_ID. o Clean up ufs_setacl() substantially since it is no longer possible to perform chown/chgrp operations using vop_setacl(), so all the access control for that can be eliminated. o Modify ufs_access() so that it passes owner uid and gid information into vaccess_acl_posix1e(). Pointed out by: jedger Obtained from: TrustedBSD Project
75515	14-Apr-2001	mckusick	Update to describe use of mdconfig instead of deprecated vnconfig. Submitted by: Steve Ames <steve@virtual-voodoo.com>
75503	14-Apr-2001	mckusick	This checkin adds support in ufs/ffs for the FS_NEEDSFSCK flag. It is described in ufs/ffs/fs.h as follows: /* * Filesystem flags. * * Note that the FS_NEEDSFSCK flag is set and cleared only by the * fsck utility. It is set when background fsck finds an unexpected * inconsistency which requires a traditional foreground fsck to be * run. Such inconsistencies should only be found after an uncorrectable * disk error. A foreground fsck will clear the FS_NEEDSFSCK flag when * it has successfully cleaned up the filesystem. The kernel uses this * flag to enforce that inconsistent filesystems be mounted read-only. / #define FS_UNCLEAN 0x01 / filesystem not clean at mount / #define FS_DOSOFTDEP 0x02 / filesystem using soft dependencies / #define FS_NEEDSFSCK 0x04 / filesystem needs sync fsck before mount */
75377	10-Apr-2001	mckusick	Directory layout preference improvements from Grigoriy Orlov <gluk@ptci.ru>. His description of the problem and solution follow. My own tests show speedups on typical filesystem intensive workloads of 5% to 12% which is very impressive considering the small amount of code change involved. ------ One day I noticed that some file operations run much faster on small file systems then on big ones. I've looked at the ffs algorithms, thought about them, and redesigned the dirpref algorithm. First I want to describe the results of my tests. These results are old and I have improved the algorithm after these tests were done. Nevertheless they show how big the perfomance speedup may be. I have done two file/directory intensive tests on a two OpenBSD systems with old and new dirpref algorithm. The first test is "tar -xzf ports.tar.gz", the second is "rm -rf ports". The ports.tar.gz file is the ports collection from the OpenBSD 2.8 release. It contains 6596 directories and 13868 files. The test systems are: 1. Celeron-450, 128Mb, two IDE drives, the system at wd0, file system for test is at wd1. Size of test file system is 8 Gb, number of cg=991, size of cg is 8m, block size = 8k, fragment size = 1k OpenBSD-current from Dec 2000 with BUFCACHEPERCENT=35 2. PIII-600, 128Mb, two IBM DTLA-307045 IDE drives at i815e, the system at wd0, file system for test is at wd1. Size of test file system is 40 Gb, number of cg=5324, size of cg is 8m, block size = 8k, fragment size = 1k OpenBSD-current from Dec 2000 with BUFCACHEPERCENT=50 You can get more info about the test systems and methods at: http://www.ptci.ru/gluk/dirpref/old/dirpref.html Test Results tar -xzf ports.tar.gz rm -rf ports mode old dirpref new dirpref speedup old dirprefnew dirpref speedup First system normal 667 472 1.41 477 331 1.44 async 285 144 1.98 130 14 9.29 sync 768 616 1.25 477 334 1.43 softdep 413 252 1.64 241 38 6.34 Second system normal 329 81 4.06 263.5 93.5 2.81 async 302 25.7 11.75 112 2.26 49.56 sync 281 57.0 4.93 263 90.5 2.9 softdep 341 40.6 8.4 284 4.76 59.66 "old dirpref" and "new dirpref" columns give a test time in seconds. speedup - speed increasement in times, ie. old dirpref / new dirpref. ------ Algorithm description The old dirpref algorithm is described in comments: /* * Find a cylinder to place a directory. * * The policy implemented by this algorithm is to select from * among those cylinder groups with above the average number of * free inodes, the one with the smallest number of directories. / A new directory is allocated in a different cylinder groups than its parent directory resulting in a directory tree that is spreaded across all the cylinder groups. This spreading out results in a non-optimal access to the directories and files. When we have a small filesystem it is not a problem but when the filesystem is big then perfomance degradation becomes very apparent. What I mean by a big file system ? 1. A big filesystem is a filesystem which occupy 20-30 or more percent of total drive space, i.e. first and last cylinder are physically located relatively far from each other. 2. It has a relatively large number of cylinder groups, for example more cylinder groups than 50% of the buffers in the buffer cache. The first results in long access times, while the second results in many buffers being used by metadata operations. Such operations use cylinder group blocks and on-disk inode blocks. The cylinder group block (fs->fs_cblkno) contains struct cg, inode and block bit maps. It is 2k in size for the default filesystem parameters. If new and parent directories are located in different cylinder groups then the system performs more input/output operations and uses more buffers. On filesystems with many cylinder groups, lots of cache buffers are used for metadata operations. My solution for this problem is very simple. I allocate many directories in one cylinder group. I also do some things, so that the new allocation method does not cause excessive fragmentation and all directory inodes will not be located at a location far from its file's inodes and data. The algorithm is: / * Find a cylinder group to place a directory. * * The policy implemented by this algorithm is to allocate a * directory inode in the same cylinder group as its parent * directory, but also to reserve space for its files inodes * and data. Restrict the number of directories which may be * allocated one after another in the same cylinder group * without intervening allocation of files. * * If we allocate a first level directory then force allocation * in another cylinder group. / My early versions of dirpref give me a good results for a wide range of file operations and different filesystem capacities except one case: those applications that create their entire directory structure first and only later fill this structure with files. My solution for such and similar cases is to limit a number of directories which may be created one after another in the same cylinder group without intervening file creations. For this purpose, I allocate an array of counters at mount time. This array is linked to the superblock fs->fs_contigdirs[cg]. Each time a directory is created the counter increases and each time a file is created the counter decreases. A 60Gb filesystem with 8mb/cg requires 10kb of memory for the counters array. The maxcontigdirs is a maximum number of directories which may be created without an intervening file creation. I found in my tests that the best performance occurs when I restrict the number of directories in one cylinder group such that all its files may be located in the same cylinder group. There may be some deterioration in performance if all the file inodes are in the same cylinder group as its containing directory, but their data partially resides in a different cylinder group. The maxcontigdirs value is calculated to try to prevent this condition. Since there is no way to know how many files and directories will be allocated later I added two optimization parameters in superblock/tunefs. They are: int32_t fs_avgfilesize; / expected average file size / int32_t fs_avgfpdir; / expected # of files per directory */ These parameters have reasonable defaults but may be tweeked for special uses of a filesystem. They are only necessary in rare cases like better tuning a filesystem being used to store a squid cache. I have been using this algorithm for about 3 months. I have done a lot of testing on filesystems with different capacities, average filesize, average number of files per directory, and so on. I think this algorithm has no negative impact on filesystem perfomance. It works better than the default one in all cases. The new dirpref will greatly improve untarring/removing/coping of big directories, decrease load on cvs servers and much more. The new dirpref doesn't speedup a compilation process, but also doesn't slow it down. Obtained from: Grigoriy Orlov <gluk@ptci.ru>
75138	03-Apr-2001	rwatson	o Indent sub-section headings to be consistent with README.extattr. Obtained from: TrustedBSD Project
75134	03-Apr-2001	rwatson	o Introduce a README file describing briefly how to use access control lists, in the style of FFS README files for soft updates and snapshots. Obtained from: TrustedBSD Project
75133	03-Apr-2001	rwatson	o Introduce a README file describing briefly how to use extended attributes, in the style of FFS README files for soft updates and snapshots. Obtained from: TrustedBSD Project
75106	03-Apr-2001	rwatson	o Change the default from using IO_SYNC on EA set and delete operations to not using IO_SYNC. Expose a sysctl (debug.ufs_extattr_sync) for enabling the use of IO_SYNC. - Use of IO_SYNC substantially degrades ACL performance when a default ACL is set on a directory, as there are four synchronous writes initiated to define both supporting EAs for new sub-directories, and to set the data; two for new files. Later, this may be optimized to two writes for sub-directories, one for new files. - IO_SYNC does not substantially improve consistency properties due to the poor consistency properties of existing permissions (which ACLs are a superset of), due to interaction with soft updates, and due to differences in handling consistency for data and file system meta-data. - In macro-benchmarks, this reduces the overhead of setting default ACLs down to the same overhead as enabling ACLs on a file system and not using them. Enabling ACLs still introduces a small overhead (I measure 7% on a -j 2 buildworld with pre-allocated EA backing store, but this is not rigorous testing, nor in any way optimized). - The sysctl will probably change to another administration method (or at least, a better name) in the near future, but consistency properties of EAs are still being worked out. The toggle is defined right now to allow easier performance analysis and exploration of possible guarantees. Obtained from: TrustedBSD Project
75077	02-Apr-2001	rwatson	o Correct an ACL implementation bug that could result in a system panic under heavy use when default ACLs were bgin inherited by new files or directories. This is done by removing a bug in default ACL reading, and improving error handling for this failure case: - Move the setting of the buffer length (len) variable to above the ACL type (ap->a_type) switch rather than having it only for ACL_TYPE_ACCESS. Otherwise, the len variable is unitialized in the ACL_TYPE_DEFAULT case, which generally worked right, but could result in failure. - Add a check for a short/long read of the ACL_TYPE_DEFAULT type from the underlying EA, resulting in EPERM rather than passing a potentially corrupted ACL back to the caller (resulting "cleaner" failures if the EA is damaged: right now, the caller will almost always panic in the presence of a corrupted EA). This code is similar to code in the ACL_TYPE_ACCESS handling in the previous switch case. - While I'm fixing this code, remove a redundant bzero() of the ACL reader buffer; it need only be initialized above the acl_type switch. Obtained from: TrustedBSD Project
74822	26-Mar-2001	rwatson	Introduce support for POSIX.1e ACLs on UFS-based file systems. This implementation is still experimental, and while fairly broadly tested, is not yet intended for production use. Support for POSIX.1e ACLs on UFS will not be MFC'd to RELENG_4. This implementation works by providing implementations of VOP_[GS]ETACL() for FFS, as well as modifying the appropriate access control and file creation routines. In this implementation, ACLs are backed into extended attributes; the base ACL (owner, group, other) permissions remain in the inode for performance and compatibility reasons, so only the extended and default ACLs are placed in extended attributes. The logic for ACL evaluation is provided by the fs-independent kern/kern_acl.c. o Introduce UFS_ACL, a compile-time configuration option that enables support for ACLs on FFS (and potentially other UFS-based file systems). o Introduce ufs_getacl(), ufs_setacl(), ufs_aclcheck(), which respectively get, set, and check the ACLs on the passed vnode. o Introduce ufs_sync_acl_from_inode(), ufs_sync_inode_from_acl() to maintain access control information between inode permissions and extended attribute data. o Modify ufs_access() to load a file access ACL and invoke vaccess_acl_posix1e() if ACLs are available on the file system o Modify ufs_mkdir() and ufs_makeinode() to associate ACLs with newly created directories and files, inheriting from the parent directory's default ACL. o Enable these new vnode operations and conditionally compiled code paths if UFS_ACL is defined. A few notes: o This implementation is fairly widely tested, but still should be considered experimental. o Currently, ACLs are not exported via NFS, instead, the summarizing file mode/etc from the inode is. This results in conservative protection behavior, similar to the behavior of ACL-nonaware programs acting locally. o It is possible that underlying binary data formats associated with this implementation may change. Consumers of the implementation should expect to find their local configuration obsoleted in the next few months, resulting in possible loss of ACL data during an upgrade. o The extended attributes interface and implementation is still undergoing modification to address portable interface concerns, as well as performance. o Many applications do not yet correctly handle ACLs. In general, due to the POSIX.1e ACL model, behavior of ACL-unaware applications will be conservative with respects to file protection; some caution is recommended. o Instructions for configuring and maintaining ACLs on UFS will be committed in the near future; in the mean time it is possible to reference the README included in the last UFS ACL distribution placed in the TrustedBSD web site: http://www.TrustedBSD.org/downloads/ Substantial debugging, hardware, travel, or connectivity support for this project was provided by: BSDi, Safeport Network Services, and NAI Labs. Significant coding contributions were made by Chris Faulhaber. Additional support was provided by Brian Feldman, Thomas Moestl, and Ilmar Habibulin. Reviewed by: jedgar, keichii, mckusick, trustedbsd-discuss, freebsd-fs Obtained from: TrustedBSD Project
74810	26-Mar-2001	phk	Send the remains (such as I have located) of "block major numbers" to the bit-bucket.
74747	24-Mar-2001	asmodai	Fix typo ); -> ,
74705	23-Mar-2001	mckusick	Check that background fsck operation is being done on a ufs filesystem. Obtained from: Robert Watson <rwatson@FreeBSD.org>
74608	21-Mar-2001	rwatson	o Remove an unnecessary debugging printf from ufs_extattr_lookup(), which resulted in the output of warning messages at boot if UFS_EXTATTR_AUTOSTART was enabled but ".attribute" and possible sub-directories weren't in a mounted MFS or UFS file systems. Pointed out by: dcs Obtained from: TrustedBSD Project
74548	21-Mar-2001	mckusick	Add kernel support for running fsck on active filesystems.
74547	21-Mar-2001	mckusick	Clear the fs_clean flag only when the FS_UNCLEAN flag is not set (as is done in unmount). Remove a snapshot inode from the superblock list when its last name goes away rather than when its last reference goes away. That way it will be properly reclaimed by fsck after a crash rather than reenabled when the filesystem is mounted.
74545	21-Mar-2001	mckusick	Report the correct inode number when panicing with freeing free inode. Report the correct block number when panicing with freeing free block.
74442	19-Mar-2001	rwatson	o Enable UFS-based extended attribute support on MFS. Note that this change is under-tested, and that MFS appears to be in the process of being deprecated in favor of FFS over md. Note also that UFS_EXTATTR_AUTOSTART doesn't make much sense on MFS unless the MFSROOT is compiled in, so manual configuration is generally required. Obtained from: TrustedBSD Project
74437	19-Mar-2001	rwatson	o Rename "namespace" argument to "attrnamespace" as namespace is a C++ reserved word. Submitted by: jkh Obtained from: TrustedBSD Project
74433	19-Mar-2001	rwatson	o Change options FFS_EXTATTR and options FFS_EXTATTR_AUTOSTART to options UFS_EXTATTR and UFS_EXTATTR_AUTOSTART respectively. This change reflects the fact that our EA support is implemented entirely at the UFS layer (modulo FFS start/stop/autostart hooks for mount and unmount events). This also better reflects the fact that [shortly] MFS will also support EAs, as well as possibly IFS. o Consumers of the EA support in FFS are reminded that as a result, they must change kernel config files to reflect the new option names. Obtained from: TrustedBSD Project
74404	18-Mar-2001	rwatson	o Caused FFS_EXTATTR_AUTOSTART to scan two sub-directories of ".attribute" off of the file system root: "user" for user attributes, and "system" for system attributes. When the scan occurs, attribute backing files discovered in those directories will be started in the respective namespaces. This re-introduces support for auto-starting of user attributes, which was removed when the "$" prefix for system attributes was replaced with explicit namespacing. For users of the TrustedBSD UFS POSIX.1e ACL code, you'll need to: mv ${FSROOT}/'$posix1e.acl_access' ${FSROOT}/system/posix1e.acl_access mv ${FSROOT}/'$posix1e.acl_default' ${FSROOT}/system/posix1e.acl_default For users of the TrustedBSD POSIX.1e Capability code, you'll need to: mv ${FSROOT}/'$posix1e.cap' ${FSROOT}/system/posix1e.cap For users of the TrustedBSD MAC code, you'll need to: mv ${FSROOT}/'$freebsd.mac' ${FSROOT}/system/freebsd.mac Updated versions of relevant patches will be released in the near future. Obtained from: TrustedBSD Project
74273	15-Mar-2001	rwatson	o Change the API and ABI of the Extended Attribute kernel interfaces to introduce a new argument, "namespace", rather than relying on a first- character namespace indicator. This is in line with more recent thinking on EA interfaces on various mailing lists, including the posix1e, Linux acl-devel, and trustedbsd-discuss forums. Two namespaces are defined by default, EXTATTR_NAMESPACE_SYSTEM and EXTATTR_NAMESPACE_USER, where the primary distinction lies in the access control model: user EAs are accessible based on the normal MAC and DAC file/directory protections, and system attributes are limited to kernel-originated or appropriately privileged userland requests. o These API changes occur at several levels: the namespace argument is introduced in the extattr_{get,set}_file() system call interfaces, at the vnode operation level in the vop_{get,set}extattr() interfaces, and in the UFS extended attribute implementation. Changes are also introduced in the VFS extattrctl() interface (system call, VFS, and UFS implementation), where the arguments are modified to include a namespace field, as well as modified to advoid direct access to userspace variables from below the VFS layer (in the style of recent changes to mount by adrian@FreeBSD.org). This required some cleanup and bug fixing regarding VFS locks and the VFS interface, as a vnode pointer may now be optionally submitted to the VFS_EXTATTRCTL() call. Updated documentation for the VFS interface will be committed shortly. o In the near future, the auto-starting feature will be updated to search two sub-directories to the ".attribute" directory in appropriate file systems: "user" and "system" to locate attributes intended for those namespaces, as the single filename is no longer sufficient to indicate what namespace the attribute is intended for. Until this is committed, all attributes auto-started by UFS will be placed in the EXTATTR_NAMESPACE_SYSTEM namespace. o The default POSIX.1e attribute names for ACLs and Capabilities have been updated to no longer include the '$' in their filename. As such, if you're using these features, you'll need to rename the attribute backing files to the same names without '$' symbols in front. o Note that these changes will require changes in userland, which will be committed shortly. These include modifications to the extended attribute utilities, as well as to libutil for new namespace string conversion routines. Once the matching userland changes are committed, a buildworld is recommended to update all the necessary include files and verify that the kernel and userland environments are in sync. Note: If you do not use extended attributes (most people won't), upgrading is not imperative although since the system call API has changed, the new userland extended attribute code will no longer compile with old include files. o Couple of minor cleanups while I'm there: make more code compilation conditional on FFS_EXTATTR, which should recover a bit of space on kernels running without EA's, as well as update copyright dates. Obtained from: TrustedBSD Project
74256	14-Mar-2001	rwatson	o In my merge, missed the one-line patch to ufs_vnops.c that removed the static prototype for ufs_readdir(). Note that ufs_readdir() was actually already non-static, the prototype was incorrect. Submitted by: jedgar
74234	14-Mar-2001	rwatson	o Implement "options FFS_EXTATTR_AUTOSTART", which depends on "options FFS_EXTATTR". When extended attribute auto-starting is enabled, FFS will scan the .attribute directory off of the root of each file system, as it is mounted. If .attribute exists, EA support will be started for the file system. If there are files in the directory, FFS will attempt to start them as attribute backing files for attributes baring the same name. All attributes are started before access to the file system is permitted, so this permits race-free enabling of attributes. For attributes backing support for security features, such as ACLs, MAC, Capabilities, this is vital, as it prevents the file system attributes from getting out of sync as a result of file system operations between mount-time and the enabling of the extended attribute. The userland extattrctl tool will still function exactly as previously. Files must be placed directly in .attribute, which must be directly off of the file system root: symbolic links are not permitted. FFS_EXTATTR will continue to be able to function without FFS_EXTATTR_AUTOSTART for sites that do not want/require auto-starting. If you're using the UFS_ACL code available from www.TrustedBSD.org, using FFS_EXTATTR_AUTOSTART is recommended. o This support is implemented by adding an invocation of ufs_extattr_autostart() to ffs_mountfs(). In addition, several new supporting calls are introduced in ufs_extattr.c: ufs_extattr_autostart(): start EAs on the specified mount ufs_extattr_lookup(): given a directory and filename, return the vnode for the file. ufs_extattr_enable_with_open(): invoke ufs_extattr_enable() after doing the equililent of vn_open() on the passed file. ufs_extattr_iterate_directory(): iterate over a directory, invoking ufs_extattr_lookup() and ufs_extattr_enable_with_open() on each entry. o This feature is not widely tested, and therefore may contain bugs, caution is advised. Several changes are in the pipeline for this feature, including breaking out of EA namespaces into subdirectories of .attribute (this is waiting on the updated EA API), as well as a per-filesystem flag indicating whether or not EAs should be auto-started. This is required because administrators may not want .attribute auto-started on all file systems, especially if non-administrators have write access to the root of a file system. Obtained from: TrustedBSD Project
73942	07-Mar-2001	mckusick	Fixes to track snapshot copy-on-write checking in the specinfo structure rather than assuming that the device vnode would reside in the FFS filesystem (which is obviously a broken assumption with the device filesystem).
73929	07-Mar-2001	jhb	Grab the process lock while calling psignal and before calling psignal.
73928	07-Mar-2001	jhb	Protect SIGDELSET of p_siglist with the proc lock.
73287	01-Mar-2001	mckusick	Free lock before returning from process_worklist_item. Obtained from: Constantine Sapuntzakis <csapuntz@stanford.edu>
73286	01-Mar-2001	adrian	Reviewed by: jlemon An initial tidyup of the mount() syscall and VFS mount code. This code replaces the earlier work done by jlemon in an attempt to make linux_mount() work. * the guts of the mount work has been moved into vfs_mount(). * move `type', `path' and `flags' from being userland variables into being kernel variables in vfs_mount(). `data' remains a pointer into userspace. * Attempt to verify the `type' and `path' strings passed to vfs_mount() aren't too long. * rework mount() and linux_mount() to take the userland parameters (besides data, as mentioned) and pass kernel variables to vfs_mount(). (linux_mount() already did this, I've just tidied it up a little more.) * remove the copyin() stuff for `path'. `data' still requires copyin() since its a pointer into userland. * set `mount->mnt_statf_mntonname' in vfs_mount() rather than in each filesystem. This variable is generally initialised with `path', and each filesystem can override it if they want to. * NOTE: f_mntonname is intiailised with "/" in the case of a root mount.
72956	23-Feb-2001	jlemon	Add a NOTE_REVOKE flag for vnodes, which is triggered from within vclean(). Use this to tell a filter attached to a vnode that the underlying vnode is no longer valid, by returning EV_EOF. PR: kern/25309, kern/25206
72953	23-Feb-2001	jlemon	Use correct list pointer when detaching knote from list.
72941	23-Feb-2001	mckusick	Free lock before calling panic so that subsequent attempt to write out buffers does not re-panic with `locking against myself'. This change should not affect normal operations of soft updates in any way.
72872	22-Feb-2001	mckusick	When cleaning up excess inode dependencies, check for being done. Reviewed by: Jan Koum <jkb@yahoo-inc.com>
72765	20-Feb-2001	mckusick	This patch corrects two problems with the rate limiting code that was introduced in revision 1.80. The problem manifested itself with a `locking against myself' panic and could also result in soft updates inconsistences associated with inodedeps. The two problems are: 1) One of the background operations could manipulate the bitmap while holding it locked with intent to create. This held lock results in a `locking against myself' panic, when the background processing that we have been coopted to do tries to lock the bitmap which we are already holding locked. To understand how to fix this problem, first, observe that we can do the background cleanups in inodedep_lookup only when allocating inodedeps (DEPALLOC is set in the call to inodedep_lookup). Second observe that calls to inodedep_lookup with DEPALLOC set can only happen from the following calls into the softdep code: softdep_setup_inomapdep softdep_setup_allocdirect softdep_setup_remove softdep_setup_freeblocks softdep_setup_directory_change softdep_setup_directory_add softdep_change_linkcnt Only the first two of these can come from ffs_alloc.c while holding a bitmap locked. Thus, inodedep_lookup must not go off to do request_cleanups when being called from these functions. This change adds a flag, NODELAY, that can be passed to inodedep_lookup to let it know that it should not do background processing in those cases. 2) The return value from request_cleanup when helping out with the cleanup was 0 instead of 1. This meant that despite the fact that we may have slept while doing the cleanups, the code did not recheck for the appearance of an inodedep (e.g., goto top in inodedep_lookup). This lead to the softdep inconsistency in which we ended up with two inodedep's for the same inode. Reviewed by: Peter Wemm <peter@yahoo-inc.com>, Matt Dillon <dillon@earth.backplane.com>
72645	18-Feb-2001	asmodai	Preceed/preceeding are not english words. Use precede and preceding.
72521	15-Feb-2001	jlemon	Extend kqueue down to the device layer. Backwards compatible approach suggested by: peter
72376	12-Feb-2001	jake	Implement a unified run queue and adjust priority levels accordingly. - All processes go into the same array of queues, with different scheduling classes using different portions of the array. This allows user processes to have their priorities propogated up into interrupt thread range if need be. - I chose 64 run queues as an arbitrary number that is greater than 32. We used to have 4 separate arrays of 32 queues each, so this may not be optimal. The new run queue code was written with this in mind; changing the number of run queues only requires changing constants in runq.h and adjusting the priority levels. - The new run queue code takes the run queue as a parameter. This is intended to be used to create per-cpu run queues. Implement wrappers for compatibility with the old interface which pass in the global run queue structure. - Group the priority level, user priority, native priority (before propogation) and the scheduling class into a struct priority. - Change any hard coded priority levels that I found to use symbolic constants (TTIPRI and TTOPRI). - Remove the curpriority global variable and use that of curproc. This was used to detect when a process' priority had lowered and it should yield. We now effectively yield on every interrupt. - Activate propogate_priority(). It should now have the desired effect without needing to also propogate the scheduling class. - Temporarily comment out the call to vm_page_zero_idle() in the idle loop. It interfered with propogate_priority() because the idle process needed to do a non-blocking acquire of Giant and then other processes would try to propogate their priority onto it. The idle process should not do anything except idle. vm_page_zero_idle() will return in the form of an idle priority kernel thread which is woken up at apprioriate times by the vm system. - Update struct kinfo_proc to the new priority interface. Deliberately change its size by adjusting the spare fields. It remained the same size, but the layout has changed, so userland processes that use it would parse the data incorrectly. The size constraint should really be changed to an arbitrary version number. Also add a debug.sizeof sysctl node for struct kinfo_proc.
72200	09-Feb-2001	bmilekic	Change and clean the mutex lock interface. mtx_enter(lock, type) becomes: mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks) mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized) similarily, for releasing a lock, we now have: mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN. We change the caller interface for the two different types of locks because the semantics are entirely different for each case, and this makes it explicitly clear and, at the same time, it rids us of the extra `type' argument. The enter->lock and exit->unlock change has been made with the idea that we're "locking data" and not "entering locked code" in mind. Further, remove all additional "flags" previously passed to the lock acquire/release routines with the exception of two: MTX_QUIET and MTX_NOSWITCH The functionality of these flags is preserved and they can be passed to the lock/unlock routines by calling the corresponding wrappers: mtx_{lock, unlock}_flags(lock, flag(s)) and mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN locks, respectively. Re-inline some lock acq/rel code; in the sleep lock case, we only inline the _obtain_lock()s in order to ensure that the inlined code fits into a cache line. In the spin lock case, we inline recursion and actually only perform a function call if we need to spin. This change has been made with the idea that we generally tend to avoid spin locks and that also the spin locks that we do have and are heavily used (i.e. sched_lock) do recurse, and therefore in an effort to reduce function call overhead for some architectures (such as alpha), we inline recursion for this case. Create a new malloc type for the witness code and retire from using the M_DEV type. The new type is called M_WITNESS and is only declared if WITNESS is enabled. Begin cleaning up some machdep/mutex.h code - specifically updated the "optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently need those. Finally, caught up to the interface changes in all sys code. Contributors: jake, jhb, jasone (in no particular order)
72012	04-Feb-2001	phk	Another round of the <sys/queue.h> FOREACH transmogriffer. Created with: sed(1) Reviewed by: md5(1)
71999	04-Feb-2001	phk	Mechanical change to use <sys/queue.h> macro API instead of fondling implementation details. Created with: sed(1) Reviewed by: md5(1)
71998	04-Feb-2001	phk	Use <sys/queue.h> macro API.
71993	04-Feb-2001	phk	Remove a DIAGNOSTIC check which belongs in <sys/queue.h> if anyplace at all.
71976	04-Feb-2001	iedowse	Extend the sanity checks in ufs_lookup to ensure that each directory entry fits within its DIRBLKSIZ block. The surrounding code is extremely fragile with respect to corruption of the directory entry 'd_reclen' field; if directory corruption occurs, it can blindly scan forward beyond the end of the filesystem block. Usually this results in a 'fault on nofault entry' panic. Directory corruption is now much more likely to be detected, resulting in a 'ufs_dirbad' panic. If the filesystem is read-only, it will simply print a warning message, and skip the corrupted block. Reviewed by: mckusick
71968	03-Feb-2001	iedowse	Use the correct flags field when checking for a read-only filesystem in ufs_dirbad(). The mnt_stat.f_flags field is only updated by the syscalls *statfs and getfsstat, so mnt_flag should be used instead. This only affects whether or not a panic is generated on detection of certain types of directory corruption. Reviewed by: mckusick
71820	30-Jan-2001	dillon	Fix a race between the syncer and umount. When you umount a softupdates filesystem softdep_process_worklist() is called in a loop until it indicates that no dependancies remain, but the determination of that fact depends on there only being one softdep_process_worklist() instance running. It was possible for the syncer to also be running softdep_process_worklist() and the pre-existing checks in the code to prevent this were not sufficient to prevent the race. This patch solves the problem. Approved-by: mckusick
71576	24-Jan-2001	jasone	Convert all simplelocks to mutexes and remove the simplelock implementations.
71073	15-Jan-2001	iedowse	The ffs superblock includes a 128-byte region for use by temporary in-core pointers to summary information. An array in this region (fs_csp) could overflow on filesystems with a very large number of cylinder groups (~16000 on i386 with 8k blocks). When this happens, other fields in the superblock get corrupted, and fsck refuses to check the filesystem. Solve this problem by replacing the fs_csp array in 'struct fs' with a single pointer, and add padding to keep the length of the 128-byte region fixed. Update the kernel and userland utilities to use just this single pointer. With this change, the kernel no longer makes use of the superblock fields 'fs_csshift' and 'fs_csmask'. Add a comment to newfs/mkfs.c to indicate that these fields must be calculated for compatibility with older kernels. Reviewed by: mckusick
70980	12-Jan-2001	mckusick	Properly compute the size of the final block of superblock summary information. Submitted by: Ian Dowse <iedowse@maths.tcd.ie>
70776	07-Jan-2001	rwatson	o Commit reems of style(9) changes, whitespace improvements, and comment cleanups. Obtained from: TrustedBSD Project
70774	07-Jan-2001	rwatson	o Zero the ufs_extattr_header length field (not necessary, but not a bad idea either) in ufs_extattr_rm. o More completely fill out the local_aio structure when writing out the zero'd extended attribute in ufs_extattr_rm -- previoulsy, this worked fine, but probably should not have. This corrects extraneous warnings about inconsistent inodes following file deletion. Reviewed by: jedgar
70773	07-Jan-2001	rwatson	o Add an additional EA inconsistency reporting opportunity in ufs_extattr_rm. o Make both reporting locations report the function name where the inconsistency is discovered, as well as the inode number in question. Reviewed by: jedgar
70767	07-Jan-2001	rwatson	o Make call to ufs_extattr_rm() in ufs_extattr_vnode_inactive() use NULL as the credential, not 0, so as to make it more clear what's going on. Obtained from: TrustedBSD Project
70764	07-Jan-2001	rwatson	o Remove unnecessary sanity check involving requested offset of extended attribute read--the offset is required to be 0 by an earlier check, meaning that it will always be within the scope of the attribute data. This change should have no impact on executed code paths other than removing the unnecessary check: please report if any new failures start to occur as a result. Obtained from: TrustedBSD Project
70374	26-Dec-2000	dillon	This implements a better launder limiting solution. There was a solution in 4.2-REL which I ripped out in -stable and -current when implementing the low-memory handling solution. However, maxlaunder turns out to be the saving grace in certain very heavily loaded systems (e.g. newsreader box). The new algorithm limits the number of pages laundered in the first pageout daemon pass. If that is not sufficient then suceessive will be run without any limit. Write I/O is now pipelined using two sysctls, vfs.lorunningspace and vfs.hirunningspace. This prevents excessive buffered writes in the disk queues which cause long (multi-second) delays for reads. It leads to more stable (less jerky) and generally faster I/O streaming to disk by allowing required read ops (e.g. for indirect blocks and such) to occur without interrupting the write stream, amoung other things. NOTE: eventually, filesystem write I/O pipelining needs to be done on a per-device basis. At the moment it is globalized.
70183	19-Dec-2000	mckusick	Several small but important fixes for snapshots: 1) Be more tolerant of missing snapshot files by only trying to decrement their reference count if they are registered as active. 2) Fix for snapshots of filesystems with block sizes larger than 8K (from Ollivier Robert <roberto@eurocontrol.fr>). 3) Fix to avoid losing last block in snapshot file when calculating blocks that need to be copied (from Don Coleman <coleman@coleman.org>).
70182	19-Dec-2000	mckusick	Get rid of spurious check in ffs_truncate for i_size == length which fails to set the modification time on the file. The same check a few lines later takes the correct action. Submitted by: Ian Dowse <iedowse@maths.tcd.ie>
70132	17-Dec-2000	assar	add a stub for softdep_slowdown so that it's possible to build the kernel without SOFTUPDATES
70131	17-Dec-2000	dillon	Avoid a data-consistency race between write() and mmap() by ensuring that newly allocated blocks are zerod. The race can occur even in the case where the write covers the entire block. Reported by: Sven Berkvens <sven@berkvens.net>, Marc Olzheim <zlo@zlo.nu>
70011	14-Dec-2000	tanimura	- Move ifs_init() so that it can initialize ifs_inode_hash_mtx. - s/ffs_inode_hash_lock/ifs_inode_hash_lock/
69974	13-Dec-2000	tanimura	Do not race for the lock of an inode hash. Reviewed by: jhb
69967	13-Dec-2000	mckusick	Preventing runaway kernel soft updates memory, take three. Previously, the syncer process was the only process in the system that could process the soft updates background work list. If enough other processes were adding requests to that list, it would eventually grow without bound. Because some of the work list requests require vnodes to be locked, it was not generally safe to let random processes process the work list while they already held vnodes locked. By adding a flag to the work list queue processing function to indicate whether the calling process could safely lock vnodes, it becomes possible to co-opt other processes into helping out with the work list. Now when the worklist gets too large, other processes can safely help out by picking off those work requests that can be handled without locking a vnode, leaving only the small number of requests requiring a vnode lock for the syncer process. With this change, it appears possible to keep even the nastiest workloads under control. Submitted by: Paul Saab <ps@yahoo-inc.com>
69781	08-Dec-2000	dwmalone	Convert more malloc+bzero to malloc+M_ZERO. Submitted by: josh@zipperup.org Submitted by: Robert Drehmel <robd@gmx.net>
69774	08-Dec-2000	phk	Staticize some malloc M_ instances.
69686	06-Dec-2000	dillon	Add necessary bwillwrite() in writev() entry point. Deal with excessive dirty buffers when msync() syncs non-contiguous dirty buffers by checking for the case in UFS before checking for clusterability.
68933	20-Nov-2000	mckusick	More aggressively rate limit the growth of soft dependency structures in the face of multiple processes doing massive numbers of filesystem operations. While this patch will work in nearly all situations, there are still some perverse workloads that can overwhelm the system. Detecting and handling these perverse workloads will be the subject of another patch. Reviewed by: Paul Saab <ps@yahoo-inc.com> Obtained from: Ethan Solomita <ethan@geocast.com>
68885	18-Nov-2000	dillon	Implement a low-memory deadlock solution. Removed most of the hacks that were trying to deal with low-memory situations prior to now. The new code is based on the concept that I/O must be able to function in a low memory situation. All major modules related to I/O (except networking) have been adjusted to allow allocation out of the system reserve memory pool. These modules now detect a low memory situation but rather then block they instead continue to operate, then return resources to the memory pool instead of cache them or leave them wired. Code has been added to stall in a low-memory situation prior to a vnode being locked. Thus situations where a process blocks in a low-memory condition while holding a locked vnode have been reduced to near nothing. Not only will I/O continue to operate, but many prior deadlock conditions simply no longer exist. Implement a number of VFS/BIO fixes (found by Ian): in biodone(), bogus-page replacement code, the loop was not properly incrementing loop variables prior to a continue statement. We do not believe this code can be hit anyway but we aren't taking any chances. We'll turn the whole section into a panic (as it already is in brelse()) after the release is rolled. In biodone(), the foff calculation was incorrectly clamped to the iosize, causing the wrong foff to be calculated for pages in the case of an I/O error or biodone() called without initiating I/O. The problem always caused a panic before. Now it doesn't. The problem is mainly an issue with NFS. Fixed casts for ~PAGE_MASK. This code worked properly before only because the calculations use signed arithmatic. Better to properly extend PAGE_MASK first before inverting it for the 64 bit masking op. In brelse(), the bogus_page fixup code was improperly throwing away the original contents of 'm' when it did the j-loop to fix the bogus pages. The result was that it would potentially invalidate parts of the WRONG page(!), leading to corruption. There may still be cases where a background bitmap write is being duplicated, causing potential corruption. We have identified a potentially serious bug related to this but the fix is still TBD. So instead this patch contains a KASSERT to detect the problem and panic the machine rather then continue to corrupt the filesystem. The problem does not occur very often.. it is very hard to reproduce, and it may or may not be the cause of the corruption people have reported. Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>) Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
68715	14-Nov-2000	mckusick	When deleting a file, the ordering of events imposed by soft updates is to first write the deleted directory entry to disk, second write the zero'ed inode to disk, and finally to release the freed blocks and the inode back to the cylinder-group map. As this ordering requires two disk writes to occur which are normally spaced about 30 seconds apart (except when memory is under duress), it takes about a minute from the time that a file is deleted until its inode and data blocks show up in the cylinder-group map for reallocation. If a file has had only a brief lifetime (less than 30 seconds from creation to deletion), neither its inode nor its directory entry may have been written to disk. If its directory entry has not been written to disk, then we need not wait for that directory block to be written as the on-disk directory block does not reference the inode. Similarly, if the allocated inode has never been written to disk, we do not have to wait for it to be written back either as its on-disk representation is still zero'ed out. Thus, in the case of a short lived file, we can simply release the blocks and inode to the cylinder-group map immediately. As the inode and its blocks are released immediately, they are immediately available for other uses. If they are not released for a minute, then other inodes and blocks must be allocated for short lived files, cluttering up the vnode and buffer caches. The previous code was a bit too aggressive in trying to release the blocks and inode back to the cylinder-group map resulting in their being made available when in fact the inode on disk had not yet been zero'ed. This patch takes a more conservative approach to doing the release which avoids doing the release prematurely.
68307	04-Nov-2000	bde	Fixed breakage of mknod() in rev.1.48 of ext2_vnops.c and rev.1.126 of ufs_vnops.c: 1) i_ino was confused with i_number, so the inode number passed to VFS_VGET() was usually wrong (usually 0U). 2) ip was dereferenced after vgone() freed it, so the inode number passed to VFS_VGET() was sometimes not even wrong. Bug (1) was usually fatal in ext2_mknod(), since ext2fs doesn't have space for inode 0 on the disk; ino_to_fsba() subtracts 1 from the inode number, so inode number 0U gives a way out of bounds array index. Bug(1) was usually harmless in ufs_mknod(); ino_to_fsba() doesn't subtract 1, and VFS_VGET() reads suitable garbage (all 0's?) from the disk for the invalid inode number 0U; ufs_mknod() returns a wrong vnode, but most callers just vput() it; the correct vnode is eventually obtained by an implicit VFS_VGET() just like it used to be. Bug (2) usually doesn't happen.
68186	01-Nov-2000	eivind	Give vop_mmap an untimely death. The opportunity to give it a timely death timed out in 1996.
68003	30-Oct-2000	phk	Add a missing <sys/systm.h>
67893	29-Oct-2000	phk	Move suser() and suser_xxx() prototypes and a related #define from <sys/proc.h> to <sys/systm.h>. Correctly document the #includes needed in the manpage. Add one now needed #include of <sys/systm.h>. Remove the consequent 48 unused #includes of <sys/proc.h>.
67885	29-Oct-2000	phk	Weaken a bogus dependency on <sys/proc.h> in <sys/buf.h> by #ifdef'ing the offending inline function (BUF_KERNPROC) on it being #included already. I'm not sure BUF_KERNPROC() is even the right thing to do or in the right place or implemented the right way (inline vs normal function). Remove consequently unneeded #includes of <sys/proc.h>
67882	29-Oct-2000	phk	Remove unneeded #include <sys/proc.h> lines.
67309	19-Oct-2000	rwatson	o Introduce new VOP_ACCESS() flag VADMIN, allowing file systems to perform "administrative" authorization checks. In most cases, the VADMIN test checks to make sure the credential effective uid is the same as the file owner. o Modify vaccess() to set VADMIN as an available right if the uid is appropriate. o Modify references to uid-based access control operations such that they now always invoke VOP_ACCESS() instead of using hard-coded policy checks. o This allows alternative UFS policies to be implemented by replacing only ufs_access() (such as mandatory system policies). o VOP_ACCESS() requires the caller to hold an exclusive vnode lock on the vnode: I believe that new invocations of VOP_ACCESS() are always called with the lock held. o Some direct checks of the uid remain, largely associated with the QUOTA and SUIDDIR code. Reviewed by: eivind Obtained from: TrustedBSD Project
67106	14-Oct-2000	adrian	Initial commit of IFS - a inode-namespaced FFS. Here is a short description: How it works: -- Basically ifs is a copy of ffs, overriding some vfs/vnops. (Yes, hack.) I didn't see the need in duplicating all of sys/ufs/ffs to get this off the ground. File creation is done through a special file - 'newfile' . When newfile is called, the system allocates and returns an inode. Note that newfile is done in a cloning fashion: fd = open("newfile", O_CREAT\|O_RDWR, 0644); fstat(fd, &st); printf("new file is %d\n", (int)st.st_ino); Once you have created a file, you can open() and unlink() it by its returned inode number retrieved from the stat call, ie: fd = open("5", O_RDWR); The creation permissions depend entirely if you have write access to the root directory of the filesystem. To get the list of currently allocated inodes, VOP_READDIR has been added which returns a directory listing of those currently allocated. -- What this entails: * patching conf/files and conf/options to include IFS as a new compile option (and since ifs depends upon FFS, include the FFS routines) * An entry in i386/conf/NOTES indicating IFS exists and where to go for an explanation * Unstaticize a couple of routines in src/sys/ufs/ffs/ which the IFS routines require (ffs_mount() and ffs_reload()) * a new bunch of routines in src/sys/ufs/ifs/ which implement the IFS routines. IFS replaces some of the vfsops, and a handful of vnops - most notably are VFS_VGET(), VOP_LOOKUP(), VOP_UNLINK() and VOP_READDIR(). Any other directory operation is marked as invalid. What this results in: * an IFS partition's create permissions are controlled by the perm/ownership of the root mount point, just like a normal directory * Each inode has perm and ownership too * IFS does NOT mean an FFS partition can be opened per inode. This is a completely seperate filesystem here * Softupdates doesn't work with IFS, and really I don't think it needs it. Besides, fsck's are FAST. (Try it :-) * Inodes 0 and 1 aren't allocatable because they are special (dump/swap IIRC). Inode 2 isn't allocatable since UFS/FFS locks all inodes in the system against this particular inode, and unravelling THAT code isn't trivial. Therefore, useful inodes start at 3. Enjoy, and feedback is definitely appreciated!
66893	09-Oct-2000	rwatson	o Sanity check was inverted, resulting in a possible spurious panic during unmount if extended attributes were in use. Correct by removing an unneeded (and undesirable) '!'.
66886	09-Oct-2000	eivind	Blow away the v_specmountpoint define, replacing it with what it was defined as (rdev->si_mountpoint)
66753	06-Oct-2000	rwatson	o Move initialization of ump from mp to the top of the function so that it is defined whenm used in ufs_extattr_uepm_destroy(), fixing a panic due to a NULL pointer dereference. Submitted by: Wesley Morgan <morganw@chemicals.tacorp.com>
66617	04-Oct-2000	rwatson	o Add call to ufs_extattr_uepm_destroy() in ffs_unmount() so as to clean up lock on extattrs. o Get for free a comment indicating where auto-starting of extended attributes will eventually occur, as it was in my commit tree also. No implementation change here, only a comment.
66616	04-Oct-2000	rwatson	o Correct use of lockdestroy() by adding a new ufs_extattr_uepm_destroy() call, which should be the last thing down to a per-mount extattr management structure, after ufs_extattr_stop() on the file system. This currently has the effect only of destroying the per-mount lock on extended attributes, and clearing appropriate flags. o Remove inappropriate invocation in ufs_extattr_vnode_inactive().
66615	04-Oct-2000	jasone	Convert lockmgr locks from using simple locks to using mutexes. Add lockdestroy() and appropriate invocations, which corresponds to lockinit() and must be called to clean up after a lockmgr lock is no longer needed.
66355	25-Sep-2000	bp	Add a lock structure to vnode structure. Previously it was either allocated separately (nfs, cd9660 etc) or keept as a first element of structure referenced by v_data pointer(ffs). Such organization leads to known problems with stacked filesystems. From this point vop_nolock() functions maintain only interlock lock. vop_stdlock() functions maintain built-in v_lock structure using lockmgr(). vop_sharedlock() is compatible with vop_stdunlock(), but maintains a shared lock on vnode. If filesystem wishes to export lockmgr compatible lock, it can put an address of this lock to v_vnlock field. This indicates that the upper filesystem can take advantage of it and use single lock structure for entire (or part) of stack of vnodes. This field shouldn't be examined or modified by VFS code except for initialization purposes. Reviewed in general by: mckusick
66187	21-Sep-2000	rwatson	o Permit UFS Extended Attributes to be associated with special devices and FIFOs. Obtained from: TrustedBSD Project
66041	18-Sep-2000	rwatson	o Disallow privileged processes in jail() from directly accessing system namespace extended attributes. o Document privilege/jail() interaction relating to extended attributes. Obtained from: TrustedBSD Project
66040	18-Sep-2000	rwatson	o Allow privileged processes in jail() to override sticky bit behavior on directories. o Allow privileged processes in jail() to create inodes with the setgid bit set even if they are not a member of the group denoted by the file creation gid. This occurs due to inherited gid's from parent directories on file creation, allowing a user to create a file with a gid that is not in the creating process's credentials. Obtained from: TrustedBSD Project
66039	18-Sep-2000	rwatson	o Add a comment clarifying interaction between jail(), privileged processes, and UFS file flags. Here's what the comment says, for reference: Privileged processes in jail() are permitted to modify arbitrary user flags on files, but are not permitted to modify system flags. In other words, privilege does allow a process in jail to modify user flags for objects that the process does not own, but privilege will not permit the setting of system flags on the file. Obtained from: TrustedBSD Project
66038	18-Sep-2000	rwatson	o Add missing PRISON_ROOT allowing a privileged process in a jail() to not remove the setuid/setgid bits by virtue of a change to a file with those bits set, even if the process doesn't own the file, or isn't a group member of the file's gid. Obtained from: TrustedBSD Project
66033	18-Sep-2000	rwatson	o Substitute suser() calls for direct credential checks, which is now safe as suser() no longer sets ASU. o Note that in some cases, the PRISON_ROOT flag is used even though no process structure is passed, to indicate that if a process structure (and hence jail) was available, it would be ok. In the long run, the jail identifier should probably be moved to ucred, as the uidinfo information was. o Some uid 0 checks remain relating to the quota code, which I'll leave for another day. Reviewed by: phk, eivind Obtained from: TrustedBSD Project
65998	17-Sep-2000	des	Silence a warning.
65973	17-Sep-2000	bp	Add new flag PDIRUNLOCK to the component.cn_flags which should be set by filesystem lookup() routine if it unlocks parent directory. This flag should be carefully tracked by filesystems if they want to work properly with nullfs and other stacked filesystems. VFS takes advantage of this flag to perform symantically correct usage of vrele() instead of vput() if parent directory already unlocked. If filesystem fails to track this flag then previous codepath in VFS left unchanged. Convert UFS code to set PDIRUNLOCK flag if necessary. Other filesystmes will be changed after some period of testing. Reviewed in general by: mckusick, dillon, adrian Obtained from: NetBSD
65928	16-Sep-2000	phk	Remove a pointless casting of a gid_t to a gid_t.
65779	12-Sep-2000	bp	Add VOP_*VOBJECT vops, because MFS requires explicit vop specification. Noted by: knu
65768	12-Sep-2000	rwatson	o Variety of extended attribute fixes - In ufs_extattr_enable(), return EEXIST instead of EOPNOTSUPP if the caller tries to configure an attribute name that is already configured - Throughout, add IO_NODELOCKED to VOP_{READ,WRITE} calls to indicate lock status of passed vnode. Apparently not a problem, but worth fixing. - For all writes, make use of IO_SYNC consistent. Really, IO_UNIT and combining of VOP_WRITE's should happen, but I don't have that tested. At least with this, it's consistent usage. (pointed out by: bde) - In ufs_extattr_get(), fixed nested locking of backing vnode (fine due to recursive lock support, but make it more consistent with other code) - In ufs_extattr_get(), clean up return code to set uio_resid more consistently with other pieces of code (worked fine, this is just a cleanup) - Fix ufs_extattr_rm(), which was broken--effectively a nop. - Minor comment and whitespace fixes. Obtained from: TrustedBSD Project
65721	11-Sep-2000	jhb	Fix a 64-bitism. Use size_t instead of int for 4th argument to copyinstr. Approved by: rwatson
65595	07-Sep-2000	mckusick	Cannot do MALLOC with M_WAITOK while holding ACQUIRE_LOCK Obtained from: Ethan Solomita <ethan@geocast.com>
65557	07-Sep-2000	jasone	Major update to the way synchronization is done in the kernel. Highlights include: * Mutual exclusion is used instead of spl(). See mutex(9). (Note: The alpha port is still in transition and currently uses both.) Per-CPU idle processes. * Interrupts are run in their own separate kernel threads and can be preempted (i386 only). Partially contributed by: BSDi (BSD/OS) Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh
65377	02-Sep-2000	rwatson	Modify extended attribute protection model to authorize based on attribute namespace and DAC protection on file: - Attribute names beginning with '$' are in the system namespace - The attribute name "$" is reserved - System namespace attributes may only be read/set by suser() or by kernel (cred == NULL) - Other attribute names are in the application namespace - The attribute name "" is reserved - Application namespace attributes are protected in the manner of the target file permission o Kernel changes - Add ufs_extattr_valid_attrname() to check whether the requested attribute "set" or "enable" is appropriate (i.e., non-reserved) - Modify ufs_extattr_credcheck() to accept target file vnode, not to take inode uid - Modify ufs_extattr_credcheck() to check namespace, then enforce either kernel/suser for system namespace, or vaccess() for application namespace o EA backing file format changes - Remove permission fields from extended attribute backing file header - Bump extended attribute backing file header version to 3 o Update extattrctl.c and extattrctl.8 - Remove now deprecated -r and -w arguments to initattr, as permissions are now implicit - (unrelated) fix error reporting and unlinking during failed initattr to remove duplicate/inaccurate error messages, and to only unlink if the failure wasn't in the backing file open() Obtained from: TrustedBSD Project
65200	29-Aug-2000	rwatson	o Restructure vaccess() so as to check for DAC permission to modify the object before falling back on privilege. Make vaccess() accept an additional optional argument, privused, to determine whether privilege was required for vaccess() to return 0. Add commented out capability checks for reference. Rename some variables to make it more clear which modes/uids/etc are associated with the object, and which with the access mode. o Update file system use of vaccess() to pass NULL as the optional privused argument. Once additional patches are applied, suser() will no longer set ASU, so privused will permit passing of privilege information up the stack to the caller. Reviewed by: bde, green, phk, -security, others Obtained from: TrustedBSD Project
65119	26-Aug-2000	rwatson	o Correct spelling of ufs_exttatr_find_attr -> ufs_extattr_find_attr o Add "const" qualifier to attrname argument of various calls to remove warnings Obtained from: TrustedBSD Project
64880	20-Aug-2000	phk	Remove all traces of Julians DEVFS (incl from kern/subr_diskslice.c) Remove old DEVFS support fields from dev_t. Make uid, gid & mode members of dev_t and set them in make_dev(). Use correct uid, gid & mode in make_dev in disk minilayer. Add support for registering alias names for a dev_t using the new function make_dev_alias(). These will show up as symlinks in DEVFS. Use makedev() rather than make_dev() for MFSs magic devices to prevent DEVFS from noticing this abuse. Add a field for DEVFS inode number in dev_t. Add new DEVFS in fs/devfs. Add devfs cloning to: disk minilayer (ie: ad(4), sd(4), cd(4) etc etc) md(4), tun(4), bpf(4), fd(4) If DEVFS add -d flag to /sbin/inits args to make it mount devfs. Add commented out DEVFS to GENERIC
64865	20-Aug-2000	phk	Centralize the canonical vop_access user/group/other check in vaccess(). Discussed with: bde
64437	09-Aug-2000	tegge	Initialize *countp to 0 in stub for softdep_flushworklist(). This allows ffs_fsync() to break out of a loop that might otherwise be infinite on kernels compiled without the SOFTUPDATES option. The observed symptom was a system hang at the first unmount attempt.
64104	01-Aug-2000	roberto	Fix the lockmgr panic everyone is seeing at shutdown time. vput assumes curproc is the lock holder, but it's not true in this case. Thanks a lot Luoqi ! Submitted by: luoqi Tested by: phk
63976	28-Jul-2000	peter	Minor tweak - removed unused variable 'struct mount *mp';
63975	28-Jul-2000	peter	Minor change: fix warning - move a 'struct vnode *vp' declaration inside a #ifdef DIAGNOSTIC to match its corresponding usage.
63897	26-Jul-2000	mckusick	Clean up the snapshot code so that it no longer depends on the use of the SF_IMMUTABLE flag to prevent writing. Instead put in explicit checking for the SF_SNAPSHOT flag in the appropriate places. With this change, it is now possible to rename and link to snapshot files. It is also possible to set or clear any of the owner, group, or other read bits on the file, though none of the write or execute bits can be set. There is also an explicit test to prevent the setting or clearing of the SF_SNAPSHOT flag via chflags() or fchflags(). Note also that the modify time cannot be changed as it needs to accurately reflect the time that the snapshot was taken. Submitted by: Robert Watson <rwatson@FreeBSD.org>
63889	26-Jul-2000	phk	Fix the "mfs_badop[vop_getwritemount] = 45" messages.
63829	25-Jul-2000	mckusick	Add stub for softdep_flushworklist() so that kernels compiled without the SOFTUPDATES option will load correctly. Obtained from: John Baldwin <jhb@bsdi.com>
63828	25-Jul-2000	mckusick	Eliminate periodic 'mfs_badop[vop_getwritemount] = 45' messages. Submitted by: Sheldon Hearn <sheldonh@uunet.co.za>
63788	24-Jul-2000	mckusick	This patch corrects the first round of panics and hangs reported with the new snapshot code. Update addaliasu to correctly implement the semantics of the old checkalias function. When a device vnode first comes into existence, check to see if an anonymous vnode for the same device was created at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than creating a new vnode for the device. This corrects a problem which caused the kernel to panic when taking a snapshot of the root filesystem. Change the calling convention of vn_write_suspend_wait() to be the same as vn_start_write(). Split out softdep_flushworklist() from softdep_flushfiles() so that it can be used to clear the work queue when suspending filesystem operations. Access to buffers becomes recursive so that snapshots can recursively traverse their indirect blocks using ffs_copyonwrite() when checking for the need for copy on write when flushing one of their own indirect blocks. This eliminates a deadlock between the syncer daemon and a process taking a snapshot. Ensure that softdep_process_worklist() can never block because of a snapshot being taken. This eliminates a problem with buffer starvation. Cleanup change in ffs_sync() which did not synchronously wait when MNT_WAIT was specified. The result was an unclean filesystem panic when doing forcible unmount with heavy filesystem I/O in progress. Return a zero'ed block when reading a block that was not in use at the time that a snapshot was taken. Normally, these blocks should never be read. However, the readahead code will occationally read them which can cause unexpected behavior. Clean up the debugging code that ensures that no blocks be written on a filesystem while it is suspended. Snapshots must explicitly label the blocks that they are writing during the suspension so that they do not cause a `write on suspended filesystem' panic. Reorganize ffs_copyonwrite() to eliminate a deadlock and also to prevent a race condition that would permit the same block to be copied twice. This change eliminates an unexpected soft updates inconsistency in fsck caused by the double allocation. Use bqrelse rather than brelse for buffers that will be needed soon again by the snapshot code. This improves snapshot performance.
63099	14-Jul-2000	rwatson	o Marius pointed out an unusually inconvenient upper bound on extended attribute data size. o Fortunately it turned out to be an unused constant left over from an earlier implementation, and is therefore being removed so as not to confuse casual observers. Submitted by: mbendiks@eunet.no
63059	13-Jul-2000	bp	Prevent possible dereference of NULL pointer. Submitted by: Marius Bendiksen <mbendiks@eunet.no>
62985	12-Jul-2000	mckusick	Brain fault, forgot to update ffs_snapshot.c with the new calling convention for vn_start_write.
62976	11-Jul-2000	mckusick	Add snapshots to the fast filesystem. Most of the changes support the gating of system calls that cause modifications to the underlying filesystem. The gating can be enabled by any filesystem that needs to consistently suspend operations by adding the vop_stdgetwritemount to their set of vnops. Once gating is enabled, the function vfs_write_suspend stops all new write operations to a filesystem, allows any filesystem modifying system calls already in progress to complete, then sync's the filesystem to disk and returns. The function vfs_write_resume allows the suspended write operations to begin again. Gating is not added by default for all filesystems as for SMP systems it adds two extra locks to such critical kernel paths as the write system call. Thus, gating should only be added as needed. Details on the use and current status of snapshots in FFS can be found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness is not included here. Unless and until you create a snapshot file, these changes should have no effect on your system (famous last words).
62968	11-Jul-2000	mckusick	Clean up warning about undeclared function by declaring softdep_fsync in mount.h instead of ffs_extern.h. The correct solution is to use an indirect function pointer so that the kernel does not have to be built with options FFS, but that will be left for another day.
62907	10-Jul-2000	phk	Finish repo-copy: Move ufs/ufs/ufs_disksubr.c to kern/subr_disklabel.c. These functions are not UFS specific and are in fact used all over the place.
62799	08-Jul-2000	mckusick	Delete README as it is now obsolete. Relevant information is in README.softupdates.
62798	08-Jul-2000	mckusick	Update to reflect current status.
62553	04-Jul-2000	mckusick	Get userland visible flags added for snapshots to give a few days advance preparation for them to get migrated into place so that subsequent changes in utilities will not fail to compile for lack of up-to-date header files in /usr/include.
62550	04-Jul-2000	mckusick	Move the truncation code out of vn_open and into the open system call after the acquisition of any advisory locks. This fix corrects a case in which a process tries to open a file with a non-blocking exclusive lock. Even if it fails to get the lock it would still truncate the file even though its open failed. With this change, the truncation is done only after the lock is successfully acquired. Obtained from: BSD/OS
62469	03-Jul-2000	phk	Make the two calls from kern/* into softupdates #ifdef SOFTUPDATES, that is way cleaner than using the softupdates_stub stunt, which should be killed when convenient. Discussed with: mckusick
62148	27-Jun-2000	phk	Move prtactive to vfs from ufs. It is used all over the place.
62033	24-Jun-2000	ache	Remove obsoleted info about linking from contrib
61926	22-Jun-2000	mckusick	Update to new copyright.
61813	18-Jun-2000	mckusick	When running with quotas enabled on a filesystem using soft updates, the system would panic when a user's inode quota was exceeded (see PR 18959 for details). This fixes that problem. PR: 18959 Submitted by: Jason Godsey <jason@unixguy.fidalgo.net>
61812	18-Jun-2000	mckusick	Some additional performance improvements. When freeing an inode check to see if it has been committed to disk. If it has never been written, it can be freed immediately. For short lived files this change allows the same inode to be reused repeatedly. Similarly, when upgrading a fragment to a larger size, if it has never been claimed by an inode on disk, it too can be freed immediately making it available for reuse often in the next slowly growing block of the same file.
61730	16-Jun-2000	phk	Revert part of my bioops change which implemented panic(8).
61729	16-Jun-2000	phk	ARGH! I have too many source trees :-( Fix prototype errors in last commit.
61724	16-Jun-2000	phk	Virtualizes & untangles the bioops operations vector. Ref: Message-ID: <18317.961014572@critter.freebsd.dk> To: current@
61698	14-Jun-2000	phk	Remove a comment which should never have made it in.
61281	05-Jun-2000	rwatson	o Remove unneeded off_t variable to clean up compile warning Obtained from: TrustedBSD Project
61237	04-Jun-2000	rwatson	o If FFS_EXTATTR is defined, don't print out an error message on unmount if an FFS partition returns EOPNOTSUPP, as it just means extended attributes weren't enabled on that partition. Prevents spurious warning per-partition at shutdown.
60938	26-May-2000	jake	Back out the previous change to the queue(3) interface. It was not discussed and should probably not happen. Requested by: msmith and others
60833	23-May-2000	jake	Change the way that the queue(3) structures are declared; don't assume that the type argument to _HEAD and _ENTRY is a struct. Suggested by: phk Reviewed by: phk Approved by: mdodd
60165	07-May-2000	rwatson	s/ffs_unmonut/ffs_unmount/ in a gratuitous ufs_extattr printf. Reported by: knu
60041	05-May-2000	phk	Separate the struct bio related stuff out of <sys/buf.h> into <sys/bio.h>. <sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall not be made a nested include according to bdes teachings on the subject of nested includes. Diskdrivers and similar stuff below specfs::strategy() should no longer need to include <sys/buf.> unless they need caching of data. Still a few bogus uses of struct buf to track down. Repocopy by: peter
59913	03-May-2000	rwatson	Don't allow VOP_GETEXTATTR to set uio->uio_offset != 0, as we don't provide locking over extended attribute operations, requiring that individual operations be atomic. Allowing non-zero starting offsets permits applications/etc to put themselves at risk for inconsistent behavior. As VOP_SETEXTATTR already prohibited non-zero write offsets, this makes sense. Suggested by: Andreas Gruenbacher <a.gruenbacher@bestbits.at>
59794	30-Apr-2000	phk	Remove unneeded #include <vm/vm_zone.h> Generated by: src/tools/tools/kerninclude
59762	29-Apr-2000	phk	s/biowait/bufwait/g Prodded by: several.
59760	29-Apr-2000	phk	Remove unneeded #include <sys/kernel.h>
59721	28-Apr-2000	mckusick	When files are given to users by root, the quota system failed to reset their grace timer as their ownership crossed the soft limit threshhold. Thus if they had been over their limit in the past, they were suddenly penalized as if they had been over their limit ever since. The fix is to check when root gives away files, that when the receiving user crosses their soft limit, their grace timer is reset. See the PR report for a detailed method of reproducing the bug. PR: kern/17128 Submitted by: Andre Albsmeier <andre.albsmeier@mchp.siemens.de> Reviewed by: Kirk McKusick <mckusick@mckusick.com>
59483	22-Apr-2000	phk	Convert the magic MFS device to a VCHR. Detected by: obrien
59400	19-Apr-2000	rwatson	o Introduce an extended attribute backing file header magic number o Introduce an extended attribute backing file header version number
59391	19-Apr-2000	phk	Remove ~25 unneeded #include <sys/conf.h> Remove ~60 unneeded #include <sys/malloc.h>
59388	19-Apr-2000	rwatson	o Cause attribute data writes to use IO_SYNC since this improves the chances of consistency with other file/directory meta-data in a write. In the current set of extended attribute applications, this does not hurt much. This should be discussed again later when it comes time to optimize performance of attributes. o Include an inode generation number in the per-attribute header information. This allows consistency verification to catch when a crash occurs, or an inode is recycled while attributes are not properly configured. For now, an irritating error message is displayed when an inconsistency occurs. At some point, may introduce an ``extattrctl check ...'' which catches these before attributes are enabled. Not today. If you get this message, it means you somehow managed to get your attribute backing file out of synch with the file system. When this occurs, attribute not found is returned (== undefined). Writes will overwrite the value there correcting the problem. Might want to think about introducing a new errno or two to handle this kind of situation. Discussed with: kris
59363	18-Apr-2000	phk	Retire bufqdisksort(), all drivers use bioqdisksort now.
59308	17-Apr-2000	jlemon	Remove unneeded cast.
59289	16-Apr-2000	jlemon	Replace the POLLEXTEND extensions with the kqueue() mechanism.
59268	16-Apr-2000	rwatson	Fix two bugs in extended attribute support for UFS/FFS: o Put back in {} removed during over-zealous cleanup of gratuitous debugging output during preparation for the commit. Due to the missing {}, writes on extended attributes always silently failed. Doh. o Don't unlock the target vnode if it's the backing vnode, as we don't lock the target vnode if it's the backing vnode.
59249	15-Apr-2000	phk	Complete the bio/buf divorce for all code below devfs::strategy Exceptions: Vinum untouched. This means that it cannot be compiled. Greg Lehey is on the case. CCD not converted yet, casts to struct buf (still safe) atapi-cd casts to struct buf to examine B_PHYS
59241	15-Apr-2000	rwatson	Introduce extended attribute support for FFS, allowing arbitrary (name, value) pairs to be associated with inodes. This support is used for ACLs, MAC labels, and Capabilities in the TrustedBSD security extensions, which are currently under development. In this implementation, attributes are backed to data vnodes in the style of the quota support in FFS. Support for FFS extended attributes may be enabled using the FFS_EXTATTR kernel option (disabled by default). Userland utilities and man pages will be committed in the next batch. VFS interfaces and man pages have been in the repo since 4.0-RELEASE and are unchanged. o ufs/ufs/extattr.h: UFS-specific extattr defines o ufs/ufs/ufs_extattr.c: bulk of support routines o ufs/{ufs,ffs,mfs}/*.[ch]: hooks and extattr.h includes o contrib/softupdates/ffs_softdep.c: extattr.h includes o conf/options, conf/files, i386/conf/LINT: added FFS_EXTATTR o coda/coda_vfsops.c: XXX required extattr.h due to ufsmount.h (This should not be the case, and will be fixed in a future commit) Currently attributes are not supported in MFS. This will be fixed. Reviewed by: adrian, bp, freebsd-fs, other unthanked souls Obtained from: TrustedBSD Project
58942	02-Apr-2000	phk	Clone bio versions of certain bits of infrastructure: devstat_end_transaction_bio() bioq_* versions of bufq_* incl bioqdisksort() the corresponding "buf" versions will disappear when no longer used. Move b_offset, b_data and b_bcount to struct bio. Add BIO_FORMAT as a hack for fd.c etc. We are now largely ready to start converting drivers to use struct bio instead of struct buf.
58934	02-Apr-2000	phk	Move B_ERROR flag to b_ioflags and call it BIO_ERROR. (Much of this done by script) Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED. Move b_pblkno and b_iodone_chain to struct bio while we transition, they will be obsoleted once bio structs chain/stack. Add bio_queue field for struct bio aware disksort. Address a lot of stylistic issues brought up by bde.
58909	02-Apr-2000	dillon	Change the write-behind code to take more care when starting async I/O's. The sequential read heuristic has been extended to cover writes as well. We continue to call cluster_write() normally, thus blocks in the file will still be reallocated for large (but still random) I/O's, but I/O will only be initiated for truely sequential writes. This solves a number of annoying situations, especially with DBM (hash method) writes, and also has the side effect of fixing a number of (stupid) benchmarks. Reviewed-by: mckusick
58365	20-Mar-2000	phk	diff, patch and cvs didn't like these three last time around, try again.
58349	20-Mar-2000	phk	Rename the existing BUF_STRATEGY() to DEV_STRATEGY() substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo) substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo) This patch is machine generated except for the ccd.c and buf.h parts.
58345	20-Mar-2000	phk	Remove B_READ, B_WRITE and B_FREEBUF and replace them with a new field in struct buf: b_iocmd. The b_iocmd is enforced to have exactly one bit set. B_WRITE was bogusly defined as zero giving rise to obvious coding mistakes. Also eliminate the redundant struct buf flag B_CALL, it can just as efficiently be done by comparing b_iodone to NULL. Should you get a panic or drop into the debugger, complaining about "b_iocmd", don't continue. It is likely to write on your disk where it should have been reading. This change is a step in the direction towards a stackable BIO capability. A lot of this patch were machine generated (Thanks to style(9) compliance!) Vinum users: Greg has not had time to test this yet, be careful.
58155	17-Mar-2000	mckusick	Use 64-bit math to calculate if we have hit our freespace limit. Necessary for coherent results on filesystems bigger than 0.5Tb.
58088	15-Mar-2000	mckusick	Bug fixes for currently harmless bugs that could rise to bite the unwary if the code were called in slightly different ways. 1) In ufs_bmaparray() the code for calculating 'runb' will stop one block short of the first entry in an indirect block. i.e. if an indirect block contains N block numbers b[0]..b[N-1] then the code will never check if b[0] and b[1] are sequential. For reference, compare with the equivalent code that deals with direct blocks. 2) In ufs_lookup() there is an off-by-one error in the test that checks if dp->i_diroff is outside the range of the the current directory size. This is completely harmless, since the following while-loop condition 'dp->i_offset < endsearch' is never met, so the code immediately does a second pass starting at dp->i_offset = 0. 3) Again in ufs_lookup(), the condition in a sanity check is wrong for directories that are longer than one block. This bug means that the sanity check is only effective for small directories. Submitted by: Ian Dowse <iedowse@maths.tcd.ie>
58087	15-Mar-2000	mckusick	Use 64-bit math to decide if optimization needs to be changed. Necessary for coherent results on filesystems bigger than 0.5Tb. Submitted by: Paul Saab <ps@yahoo-inc.com>
57869	09-Mar-2000	dillon	In the 'found' case for ufs_lookup() the underlying bp's data was being accessed after the bp had been releaed. A simple move of the brelse() solves the problem. Approved by: jkh Submitted by: Ian Dowse <iedowse@maths.tcd.ie>
57446	24-Feb-2000	dillon	Fix a 'freeing free block' panic in UFS. The problem occurs when the filesystem fills up. If the first indirect block exists and FFS is able to allocate deeper indirect blocks, but is not able to allocate the data block, FFS improperly unwinds the indirect blocks and leaves a block pointer hanging to a freed block. This will cause a panic later when the file is removed. The solution is to properly account for the first block-pointer-to-an-indirect-block we had to create in a balloc operation and then unwind it if a failure occurs. Detective work by: Ian Dowse <iedowse@maths.tcd.ie> Reviewed by: mckusick, Ian Dowse <iedowse@maths.tcd.ie> Approved by: jkh
57387	22-Feb-2000	rwatson	After much consulting with bde, concluded that this fix was the best fix to the current jail/chflags interactions. This fix conditionalizes ``root behavior'' in the chflags() case on not being in jail, so attempts to perform a chflags in a jail are limited to what a normal user could do. For example, this does allow setting of user flags as appropriate, but prohibits changing of system flags. Reviewed by: bde
57347	20-Feb-2000	rwatson	Disable chflags() from within jail() so that root within jail can't make a mess in securelevel environments. Results in one warning during /etc/rc as it attempts to remove file flags, but this is harmless. Approved by: High Lord Hubbard
56908	30-Jan-2000	mckusick	When writing out bitmap buffers, need to skip over ones that already have a write in progress. Otherwise one can get in an infinite loop trying to get them all flushed. Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
56209	18-Jan-2000	mckusick	During fastpath processing for removal of a short-lived inode, the set of restrictions for cancelling an inode dependency (inodedep) is somewhat stronger than originally coded. Since this check appears in two places, we codify it into the function check_inode_unwritten which we then call from the two sites, one freeing blocks and the other freeing directory entries. Submitted by: Steinar Haug via Matthew Dillon
56208	18-Jan-2000	mckusick	Need to reorganize the flushing of directory entry (pagedep) dependencies so that they never try to lock an inode corresponding to ".." as this can lead to deadlock. We observe that any inode with an updated link count is always pushed into its buffer at the time of the link count change, so we do not need to do a VOP_UPDATE, but merely find its buffer and write it. The only time we need to get the inode itself is from the result of a mkdir whose name will never be ".." and hence locking such an inode will never request a lock above us in the filesystem tree. Thanks to Brian Fundakowski Feldman for providing the test program that tickled soft updates into hanging in "inode" sleep. Submitted by: Brian Fundakowski Feldman <green@FreeBSD.org>
56150	17-Jan-2000	mckusick	Better bounding on softdep_flushfiles; other minor tweeks to checks.
56149	17-Jan-2000	mckusick	Must track multiple uncommitted renames until one ultimately gets committed to disk or is removed.
55947	14-Jan-2000	dillon	Non-operational change, fix compiler warning. Reviewed by: mckusick
55931	13-Jan-2000	mckusick	Confirming Peter's fix (locking 101: release the lock before you go to sleep). Locking 101, part 2: do not look at buffer contents after you have been asleep. There is no telling what wonderous changes may have occurred.
55928	13-Jan-2000	peter	Free the global softupdates lock prior to tsleep() in getdirtybuf(). This seems to be responsible for a bunch of panics where the process sleeps and something else finds softupdates "locked" when it shouldn't be. This commit is unreviewed, but has been a big help here. Previously my boxes would panic pretty much on the first fsync() that wrote something to disk.
55886	13-Jan-2000	mckusick	Because cylinder group blocks are now written in background, it is no longer sufficient to get a lock on a buffer to know that its write has been completed. We have to first get the lock on the buffer, then check to see if it is doing a background write. If it is doing background write, we have to wait for the background write to finish, then check to see if that fullfilled our dependency, and if not to start another write. Luckily the explanation is longer than the fix.
55885	13-Jan-2000	mckusick	A panic occurs during an fsync when a dirty block associated with a vnode has not been written (which would clear certain of its dependencies). The problems arises because fsync with MNT_NOWAIT no longer pushes all the dirty blocks associated with a vnode. It skips those that require rollbacks, since they will just get instantly dirty again. Such skipped blocks are marked so that they will not be skipped a second time (otherwise circular dependencies would never clear). So, we fsync twice to ensure that everything will be written at least once.
55799	11-Jan-2000	mckusick	The only known cause of this panic is running out of disk space. The problem occurs when an indirect block and a data block are being allocated at the same time. For example when the 13th block of the file is written, the filesystem needs to allocate the first indirect block and a data block. If the indirect block allocation succeeds, but the data block allocation fails, the error code dellocates the indirect block as it has nothing at which to point. Unfortunately, it does not deallocate the indirect block's associated dependencies which then fail when they find the block unexpectedly gone (ptr == 0 instead of its expected value). The fix is to fsync the file before doing the block rollback, as the fsync will flush out all of the dependencies. Once the rollback is done the file must be fsync'ed again so that the soft updates code does not find unexpected changes. This approach is much slower than writing the code to back out the extraneous dependencies, but running out of disk space is not expected to be a common occurence, so just getting it right is the main criterion. PR: kern/15063 Submitted by: Assar Westerlund <assar@stacken.kth.se>
55794	11-Jan-2000	mckusick	We cannot proceed to free the blocks of the file until the dependencies have been cleaned up by deallocte_dependencies(). Once that is done, it is safe to post the request to free the blocks. A similar change is also needed for the freefile case.
55756	10-Jan-2000	phk	Give vn_isdisk() a second argument where it can return a suitable errno. Suggested by: bde
55726	10-Jan-2000	mckusick	Missing FREE_LOCK call before handle_workitem_freeblocks. Submitted by: "Kenneth D. Merry" <ken@kdm.org>
55697	10-Jan-2000	mckusick	Several performance improvements for soft updates have been added: 1) Fastpath deletions. When a file is being deleted, check to see if it was so recently created that its inode has not yet been written to disk. If so, the delete can proceed to immediately free the inode. 2) Background writes: No file or block allocations can be done while the bitmap is being written to disk. To avoid these stalls, the bitmap is copied to another buffer which is written thus leaving the original available for futher allocations. 3) Link count tracking. Constantly track the difference in i_effnlink and i_nlink so that inodes that have had no change other than i_effnlink need not be written. 4) Identify buffers with rollback dependencies so that the buffer flushing daemon can choose to skip over them.
55694	09-Jan-2000	mckusick	Keep tighter control of removal dependencies by limiting the number of dirrem structure rather than the collaterally created freeblks and freefile structures. Limit the rate of buffer dirtying by the syncer process during periods of intense file removal.
55692	09-Jan-2000	mckusick	Reorganize softdep_fsync so that it only does the inode-is-flushed check before the inode is unlocked while grabbing its parent directory. Once it is unlocked, other operations may slip in that could make the inode-is-flushed check fail. Allowing other writes to the inode before returning from fsync does not break the semantics of fsync since we have flushed everything that was dirty at the time of the fsync call.
55691	09-Jan-2000	mckusick	Get rid of unreferenced function.
55690	09-Jan-2000	mckusick	Make static non-exported functions from soft updates.
55206	29-Dec-1999	peter	Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL" is an application space macro and the applications are supposed to be free to use it as they please (but cannot). This is consistant with the other BSD's who made this change quite some time ago. More commits to come.
55029	23-Dec-1999	bde	Update the unclean flag for mount -u. I forgot to handle this case when I made the absence of the clean flag sticky in rev.1.88. This was a problem main for "mount /". There is no way to mount "/" for writing without using mount -u (normally implicitly), so after "mount -f /" of an unclean filesystem, the absence of the clean flag was sticky forever.
54952	21-Dec-1999	eivind	Change incorrect NULLs to 0s
54803	19-Dec-1999	rwatson	Second pass commit to introduce new ACL and Extended Attribute system calls, vnops, vfsops, both in /kern, and to individual file systems that require a vfsop_ array entry. Reviewed by: eivind
54700	16-Dec-1999	mckusick	The function request_cleanup() had a tsleep() with PCATCH. It is quite dangerous, since the process may hold locks at the point, and if it is stopped in that tsleep the machine may hang. Because the sleep is so short, the PCATCH is not required here, so it has been removed. For the future, the FreeBSD team needs to decide whether it is still reasonable to stop a process in tsleep, as that may affect any other code that uses PCATCH while holding kernel locks. Submitted by: Dmitrij Tejblum <tejblum@arc.hq.cti.ru> Reviewed by: Kirk McKusick <mckusick@mckusick.com>
54655	15-Dec-1999	eivind	Introduce NDFREE (and remove VOP_ABORTOP)
54444	11-Dec-1999	eivind	Lock reporting and assertion changes. * lockstatus() and VOP_ISLOCKED() gets a new process argument and a new return value: LK_EXCLOTHER, when the lock is held exclusively by another process. * The ASSERT_VOP_(UN)LOCKED family is extended to use what this gives them * Extend the vnode_if.src format to allow more exact specification than locked/unlocked. This commit should not do any semantic changes unless you are using DEBUG_VFS_LOCKS. Discussed with: grog, mch, peter, phk Reviewed by: peter
54049	03-Dec-1999	billf	Remove the 'alpha, use at your own risk' death-statement. Reviewed by: mckusick (verbally at FreeBSDcon)
54048	03-Dec-1999	billf	Fix typo, add $FreeBSD$
53996	01-Dec-1999	mckusick	Preferentially allocate the first indirect block in the same cylinder group as the inode. This makes a 15% difference in read speed for files in the 96K to 500K size range.
53722	26-Nov-1999	phk	Retire MFS_ROOT and MFS_ROOT_SIZE options from the MFS implementation. Add MD_ROOT and MD_ROOT_SIZE options to the md driver. Make the md driver handle MFS_ROOT and MFS_ROOT_SIZE options for compatibility. Add md driver to GENERIC, PCCARD and LINT. This is a cleanup which removes the need for some of the worse hacks in MFS: We really want to have a rootvnode but MFS on a preloaded image doesn't really have one. md is a true device, so it is less trouble. This has been tested with make release, and if people remember to add the "md" pseudo-device to their kernels, PicoBSD should be just fine as well. If people have no other use for MFS, it can be removed from the kernel.
53577	22-Nov-1999	phk	Convert various pieces of code to use vn_isdisk() rather than checking for vp->v_type == VBLK. In ccd: we don't need to call VOP_GETATTR to find the type of a vnode. Reviewed by: sos
53464	20-Nov-1999	eivind	We do not have ffs_checkexp, so remove the prototype
53452	20-Nov-1999	phk	struct mountlist and struct mount.mnt_list have no business being a CIRCLEQ. Change them to TAILQ_HEAD and TAILQ_ENTRY respectively. This removes ugly mp != (void*)&mountlist comparisons. Requested by: phk Submitted by: Jake Burkholder jake@checker.org PR: 14967
53360	18-Nov-1999	peter	Fix a warning (unused static declaration without MFS_ROOT)
53131	13-Nov-1999	eivind	Remove WILLRELE from VOP_SYMLINK Note: Previous commit to these files (except coda_vnops and devfs_vnops) that claimed to remove WILLRELE from VOP_RENAME actually removed it from VOP_MKNOD.
53101	12-Nov-1999	eivind	Remove WILLRELE from VOP_RENAME
53059	09-Nov-1999	phk	Next step in the device cleanup process. Correctly lock vnodes when calling VOP_OPEN() from filesystem mount code. Unify spec_open() for bdev and cdev cases. Remove the disabled bdev specific read/write code.
52838	03-Nov-1999	bde	Quick fix for breakage of ext2fs link counts as reported by stat(2) by the soft updates changes: only report the link count to be i_effnlink in ufs_getattr() for file systems that maintain i_effnlink. Tested by: Mike Dracopoulos <mdraco@math.uoa.gr>
52836	03-Nov-1999	msmith	Make MFS work with the new root filesystem search process. In order to achieve this, root filesystem mount is moved from SI_ORDER_FIRST to SI_ORDER_SECOND in the SI_SUB_MOUNT_ROOT sysinit group. Now, modules which wish to usurp the default root mount can use SI_ORDER_FIRST. A compiled-in or preloaded MFS filesystem will become the root filesystem unless the vfs.root.mountfrom environment variable refers to a valid bootable device. This will normally only be the case when the kernel and MFS image have been loaded from a disk which has a valid /etc/fstab file. In this case, the variable should be manually overridden in the loader, or the kernel booted with -a. In either case "mfs:" should be supplied as the new value. Also fix a typo in one DFLTROOT case that would not have compiled.
52782	01-Nov-1999	msmith	Newline-terminate the complaint message about not being able to find the root vnode pointer.
52641	30-Oct-1999	dillon	Add sysctl debug.dircheck to allow directory sanity checking to be turned on with a sysctl. Fix two bugs in ufs_lookup that can cause deadlocks due to out-of-order locking. This fix was tested for a few days prior to commit.
52635	29-Oct-1999	phk	useracc() the prequel: Merge the contents (less some trivial bordering the silly comments) of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts the #defines for the vm_inherit_t and vm_prot_t types next to their typedefs. This paves the road for the commit to follow shortly: change useracc() to use VM_PROT_{READ\|WRITE} rather than B_{READ\|WRITE} as argument.
51808	30-Sep-1999	phk	Remove the D_NOCLUSTER[RW] options which were added because vn had problems. Now that Matt has fixed vn, this can go. The vn driver should have used d_maxio (now si_iosize_max) anyway.
51797	29-Sep-1999	phk	Remove v_maxio from struct vnode. Replace it with mnt_iosize_max in struct mount. Nits from: bde
51791	29-Sep-1999	marcel	sigset_t change (part 2 of 5) ----------------------------- The core of the signalling code has been rewritten to operate on the new sigset_t. No methodological changes have been made. Most references to a sigset_t object are through macros (see signalvar.h) to create a level of abstraction and to provide a basis for further improvements. The NSIG constant has not been changed to reflect the maximum number of signals possible. The reason is that it breaks programs (especially shells) which assume that all signals have a non-null name in sys_signame. See src/bin/sh/trap.c for an example. Instead _SIG_MAXSIG has been introduced to hold the maximum signal possible with the new sigset_t. struct sigprop has been moved from signalvar.h to kern_sig.c because a) it is only used there, and b) access must be done though function sigprop(). The latter because the table doesn't holds properties for all signals, but only for the first NSIG signals. signal.h has been reorganized to make reading easier and to add the new and/or modified structures. The "old" structures are moved to signalvar.h to prevent namespace polution. Especially the coda filesystem suffers from the change, because it contained lines like (p->p_sigmask == SIGIO), which is easy to do for integral types, but not for compound types. NOTE: kdump (and port linux_kdump) must be recompiled. Thanks to Garrett Wollman and Daniel Eischen for pressing the importance of changing sigreturn as well.
51658	25-Sep-1999	phk	Remove five now unused fields from struct cdevsw. They should never have been there in the first place. A GENERIC kernel shrinks almost 1k. Add a slightly different safetybelt under nostop for tty drivers. Add some missing FreeBSD tags
51486	20-Sep-1999	dillon	More removals of vnode->v_lastr, replaced by preexisting seqcount heuristic to detect sequential operation. VM-related forced clustering code removed from ufs in preparation for a commit to vm/vm_fault.c that does it more generally. Reviewed by: David Greenman <dg@root.com>, Alan Cox <alc@cs.rice.edu>
51483	20-Sep-1999	phk	Fix a harmless bug I introduced, simplify a bit more while here.
51479	20-Sep-1999	phk	Step one of replacing devsw->d_maxio with si_bsize_max. Rename dev->si_bsize_max to si_iosize_max and set it in spec_open if the device didn't. Set vp->v_maxio from dev->si_bsize_max in spec_open rather than in ufs_bmap.c
51226	13-Sep-1999	bde	Removed diskerr()'s unused d_name arg and updated callers. This fixes warnings caused by the arg having the wrong type (not const enough). The arg was also wrong (a full name instead of a short one) for calls from from subr_diskmbr.c and pc98/diskslice_machdep.c.
51138	11-Sep-1999	alfred	Seperate the export check in VFS_FHTOVP, exports are now checked via VFS_CHECKEXP. Add fh(open\|stat\|stafs) syscalls to allow userland to query filesystems based on (network) filehandle. Obtained from: NetBSD
51111	09-Sep-1999	julian	Changes to centralise the default blocksize behaviour. More likely to follow. Submitted by: phk@freebsd.org
50830	03-Sep-1999	julian	Revert a bunch of contraversial changes by PHK. After a quick think and discussion among various people some form of some of these changes will probably be recommitted. The reversion requested was requested by dg while discussions proceed. PHK has indicated that he can live with this, and it has been agreed that some form of some of these changes may return shortly after further discussion.
50623	30-Aug-1999	phk	Make bdev userland access work like cdev userland access unless the highly non-recommended option ALLOW_BDEV_ACCESS is used. (bdev access is evil because you don't get write errors reported.) Kill si_bsize_best before it kills Matt :-) Use the specfs routines rather having cloned copies in devfs.
50521	28-Aug-1999	phk	remove unused variables.
50511	28-Aug-1999	phk	We don't need to pass the diskname argument all over the diskslice/label code, we can find the name from any convenient dev_t
50480	28-Aug-1999	peter	$Id$ -> $FreeBSD$
50477	28-Aug-1999	peter	$Id$ -> $FreeBSD$
50405	26-Aug-1999	phk	Simplify the handling of VCHR and VBLK vnodes using the new dev_t: Make the alias list a SLIST. Drop the "fast recycling" optimization of vnodes (including the returning of a prexisting but stale vnode from checkalias). It doesn't buy us anything now that we don't hardlimit vnodes anymore. Rename checkalias2() and checkalias() to addalias() and addaliasu() - which takes dev_t and udev_t arg respectively. Make the revoke syscalls use vcount() instead of VALIASED. Remove VALIASED flag, we don't need it now and it is faster to traverse the much shorter lists than to maintain the flag. vfs_mountedon() can check the dev_t directly, all the vnodes point to the same one. Print the devicename in specfs/vprint(). Remove a couple of stale LFS vnode flags. Remove unimplemented/unused LK_DRAINED;
50347	25-Aug-1999	phk	Introduce vn_isdisk(struct vnode *vp) function, and use it to test for diskness.
50316	24-Aug-1999	phk	Initialize the si_bsize fields for the MFS bogodevices. (This broke MFS rootfs and thereby installation)
50305	24-Aug-1999	sheldonh	Fix bug introduced in rev 1.28, which causes kernel build to break for the case where DEBUG is defined but not DIAGNOSTIC. ffs_checkblk is declared conditionally on DIAGNOSTIC, not DEBUG. PR: 13314 Reviewed by: bde
50253	23-Aug-1999	bde	Use devtoname() to print dev_t's instead of casting them to long or u_long for misprinting in %lx format.
50137	22-Aug-1999	jdp	Support full-precision file timestamps. Until now, only the seconds have been maintained, and that is still the default. A new sysctl variable "vfs.timestamp_precision" can be used to enable higher levels of precision: 0 = seconds only; nanoseconds zeroed (default). 1 = seconds and nanoseconds, accurate within 1/HZ. 2 = seconds and nanoseconds, truncated to microseconds. >=3 = seconds and nanoseconds, maximum precision. Level 1 uses getnanotime(), which is fast but can be wrong by up to 1/HZ. Level 2 uses microtime(). It might be desirable for consistency with utimes() and friends, which take timeval structures rather than timespecs. Level 3 uses nanotime() for the higest precision. I benchmarked levels 0, 1, and 3 by copying a 550 MB tree with "cpio -pdu". There was almost negligible difference in the system times -- much less than 1%, and less than the variation among multiple runs at the same level. Bruce Evans dreamed up a torture test involving 1-byte reads with intervening fstat() calls, but the cpio test seems more realistic to me. This feature is currently implemented only for the UFS (FFS and MFS) filesystems. But I think it should be easy to support it in the others as well. An earlier version of this was reviewed by Bruce. He's not to blame for any breakage I've introduced since then. Reviewed by: bde (an earlier version of the code)
49945	17-Aug-1999	alc	Add the (inline) function vm_page_undirty for clearing the dirty bitmask of a vm_page. Use it. Submitted by: dillon
49771	14-Aug-1999	phk	Spring cleaning around strategy and disklabels/slices: Introduce BUF_STRATEGY(struct buf *, int flag) macro, and use it throughout. please see comment in sys/conf.h about the flag argument. Remove strategy argument from all the diskslice/label/bad144 implementations, it should be found from the dev_t. Remove bogus and unused strategy1 routines. Remove open/close arguments from dssize(). Pick them up from dev_t. Remove unused and unfinished setgeom support from diskslice/label/bad144 code.
49682	13-Aug-1999	phk	Move the special-casing of stat(2)->st_blksize for device files from UFS to the generic level. For chr/blk devices we don't care about the blocksize of the filesystem, we want what the device asked for.
49679	13-Aug-1999	phk	The bdevsw() and cdevsw() are now identical, so kill the former.
49678	13-Aug-1999	phk	s/v_specinfo/v_rdev/
49535	08-Aug-1999	phk	Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>, a few lines into <sys/vnode.h>. Add a few fields to struct specinfo, paving the way for the fun part.
49338	01-Aug-1999	alc	Move the memory access behavior information provided by madvise from the vm_object to the vm_map. Submitted by: dillon
49073	25-Jul-1999	bde	Fixed access timestamp bugs: Set IN_ACCESS for successful reads of 0 bytes (except for requests to read 0 bytes). This was broken in rev.1.42. PR: misc/10148 Don't set IN_ACCESS for requests to read 0 bytes. Don't set IN_ACCESS for unsuccessful reads.
48936	20-Jul-1999	phk	Now a dev_t is a pointer to struct specinfo which is shared by all specdev vnodes referencing this device. Details: cdevsw->d_parms has been removed, the specinfo is available now (== dev_t) and the driver should modify it directly when applicable, and the only driver doing so, does so: vn.c. I am not sure the logic in checking for "<" was right before, and it looks even less so now. An intial pool of 50 struct specinfo are depleted during early boot, after that malloc had better work. It is likely that fewer than 50 would do. Hashing is done from udev_t to dev_t with a prime number remainder hash, experiments show no better hash available for decent cost (MD5 is only marginally better) The prime number used should not be close to a power of two, we use 83 for now. Add new checkalias2() to get around the loss of info from dev2udev() in bdevvp(); The aliased vnodes are hung on a list straight of the dev_t, and speclisth[SPECSZ] is unused. The sharing of struct specinfo means that the v_specnext moves into the vnode which grows by 4 bytes. Don't use a VBLK dev_t which doesn't make sense in MFS, now we hang a dummy cdevsw on B/Cmaj 253 so that things look sane. Storage overhead from all of this is O(50k). Bump __FreeBSD_version to 400009 The next step will add the stuff needed so device-drivers can start to hang things from struct specinfo
48859	17-Jul-1999	phk	I have not one single time remembered the name of this function correctly so obviously I gave it the wrong name. s/umakedev/makeudev/g
48801	13-Jul-1999	mckusick	Create the macro DOINGASYNC to check whether the MNT_ASYNC flag has been set for a mount point. Insert missing checks to ensure that all write operations are done asynchronously when the MNT_ASYNC option has been requested. Submitted by: Craig A Soules <soules+@andrew.cmu.edu> Reviewed by: Kirk McKusick <mckusick@mckusick.com>
48759	11-Jul-1999	phk	Use the fsid from the superblock, unless it looks bogus or has already been taken by some other filesystem.
48677	08-Jul-1999	mckusick	These changes appear to give us benefits with both small (32MB) and large (1G) memory machine configurations. I was able to run 'dbench 32' on a 32MB system without bring the machine to a grinding halt. * buffer cache hash table now dynamically allocated. This will have no effect on memory consumption for smaller systems and will help scale the buffer cache for larger systems. * minor enhancement to pmap_clearbit(). I noticed that all the calls to it used constant arguments. Making it an inline allows the constants to propogate to deeper inlines and should produce better code. * removal of inherent vfs_ioopt support through the emplacement of appropriate #ifdef's, with John's permission. If we do not find a use for it by the end of the year we will remove it entirely. * removal of getnewbufloops* counters & sysctl's - no longer necessary for debugging, getnewbuf() is now optimal. * buffer hash table functions removed from sys/buf.h and localized to vfs_bio.c * VFS_BIO_NEED_DIRTYFLUSH flag and support code added ( bwillwrite() ), allowing processes to block when too many dirty buffers are present in the system. * removal of a softdep test in bdwrite() that is no longer necessary now that bdwrite() no longer attempts to flush dirty buffers. * slight optimization added to bqrelse() - there is no reason to test for available buffer space on B_DELWRI buffers. * addition of reverse-scanning code to vfs_bio_awrite(). vfs_bio_awrite() will attempt to locate clusterable areas in both the forward and reverse direction relative to the offset of the buffer passed to it. This will probably not make much of a difference now, but I believe we will start to rely on it heavily in the future if we decide to shift some of the burden of the clustering closer to the actual I/O initiation. * Removal of the newbufcnt and lastnewbuf counters that Kirk added. They do not fix any race conditions that haven't already been fixed by the gbincore() test done after the only call to getnewbuf(). getnewbuf() is a static, so there is no chance of it being misused by other modules. ( Unless Kirk can think of a specific thing that this code fixes. I went through it very carefully and didn't see anything ). * removal of VOP_ISLOCKED() check in flushbufqueues(). I do not think this check is necessary, the buffer should flush properly whether the vnode is locked or not. ( yes? ). * removal of extra arguments passed to getnewbuf() that are not necessary. * missed cluster_wbuild() that had to be a cluster_wbuild_wb() in vfs_cluster.c * vn_write() now calls bwillwrite() PRIOR to locking the vnode, which should greatly aid flushing operations in heavy load situations - both the pageout and update daemons will be able to operate more efficiently. * removal of b_usecount. We may add it back in later but for now it is useless. Prior implementations of the buffer cache never had enough buffers for it to be useful, and current implementations which make more buffers available might not benefit relative to the amount of sophistication required to implement a b_usecount. Straight LRU should work just as well, especially when most things are VMIO backed. I expect that (even though John will not like this assumption) directories will become VMIO backed some point soon. Submitted by: Matthew Dillon <dillon@backplane.com> Reviewed by: Kirk McKusick <mckusick@mckusick.com>
48656	07-Jul-1999	roberto	Add $Id$ Approved by: kirk
48536	03-Jul-1999	jdp	Update pathnames for new location of soft-updates sources.
48334	29-Jun-1999	mckusick	No longer need to set B_ASYNC flag since BUF_KERNPROC now unconditionally sets the identity of the buffer.
48276	27-Jun-1999	peter	Keep the inlines for <sys/buf.h> happy..
48225	26-Jun-1999	mckusick	Convert buffer locking from using the B_BUSY and B_WANTED flags to using lockmgr locks. This commit should be functionally equivalent to the old semantics. That is, all buffer locking is done with LK_EXCLUSIVE requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will be done in future commits.
47995	18-Jun-1999	mckusick	On our final pass through ffs_fsync, do all I/O synchronously so that we can find out if our flush is failing because of write errors. This change avoids a "flush failed" panic during unrecoverable disk errors.
47964	16-Jun-1999	mckusick	Add a vnode argument to VOP_BWRITE to get rid of the last vnode operator special case. Delete special case code from vnode_if.sh, vnode_if.src, umap_vnops.c, and null_vnops.c.
47940	15-Jun-1999	mckusick	Get rid of the global variable rushjob and replace it with a function in kern/vfs_subr.c named speedup_syncer() which handles the speedup request. Change the various clients of rushjob to use the new function.
47640	31-May-1999	phk	Simplify cdevsw registration. The cdevsw_add() function now finds the major number(s) in the struct cdevsw passed to it. cdevsw_add_generic() is no longer needed, cdevsw_add() does the same thing. cdevsw_add() will print an message if the d_maj field looks bogus. Remove nblkdev and nchrdev variables. Most places they were used bogusly. Instead check a dev_t for validity by seeing if devsw() or bdevsw() returns NULL. Move bdevsw() and devsw() functions to kern/kern_conf.c Bump __FreeBSD_version to 400006 This commit removes: 72 bogus makedev() calls 26 bogus SYSINIT functions if_xe.c bogusly accessed cdevsw[], author/maintainer please fix. I4b and vinum not changed. Patches emailed to authors. LINT probably broken until they catch up.
47443	24-May-1999	jb	- Back out Luoqi's cdevsw stuff. It panics on my system and is not required. - Fix an error message. - Do the MFS_ROOT setting of mountrootfsname in mfs_init() instead of cpu_rootconf(). - Set rootdev in mfs_init instead of later in mfs_mount() iff MFS_ROOT.
47381	22-May-1999	julian	Cosmetic changes to make it compile without errors in gcc -Wall
47202	14-May-1999	luoqi	Legally acquire a major number for mfs.
47131	14-May-1999	mckusick	Add a hook to ffs_fsync to allow soft updates to get first chance at doing a sync on the block device for the filesystem. That allows it to push the bitmap blocks before the inode blocks which greatly reduces the number of inode rollbacks that need to be done.
47085	12-May-1999	peter	Try and fix a dev_t/major/minor etc nit.
47028	11-May-1999	phk	Divorce "dev_t" from the "major\|minor" bitmap, which is now called udev_t in the kernel but still called dev_t in userland. Provide functions to manipulate both types: major() umajor() minor() uminor() makedev() umakedev() dev2udev() udev2dev() For now they're functions, they will become in-line functions after one of the next two steps in this process. Return major/minor/makedev to macro-hood for userland. Register a name in cdevsw[] for the "filedescriptor" driver. In the kernel the udev_t appears in places where we have the major/minor number combination, (ie: a potential device: we may not have the driver nor the device), like in inodes, vattr, cdevsw registration and so on, whereas the dev_t appears where we carry around a reference to a actual device. In the future the cdevsw and the aliased-from vnode will be hung directly from the dev_t, along with up to two softc pointers for the device driver and a few houskeeping bits. This will essentially replace the current "alias" check code (same buck, bigger bang). A little stunt has been provided to try to catch places where the wrong type is being used (dev_t vs udev_t), if you see something not working, #undef DEVT_FASCIST in kern/kern_conf.c and see if it makes a difference. If it does, please try to track it down (many hands make light work) or at least try to reproduce it as simply as possible, and describe how to do that. Without DEVT_FASCIST I belive this patch is a no-op. Stylistic/posixoid comments about the userland view of the <sys/*.h> files welcome now, from userland they now contain the end result. Next planned step: make all dev_t's refer to the same devsw[] which means convert BLK's to CHR's at the perimeter of the vnodes and other places where they enter the game (bootdev, mknod, sysctl).
46956	11-May-1999	bde	Fixed disordering in previous 2 commits.
46915	10-May-1999	peter	Move the mfs_getimage() prototype to mfs_extern.h duplicating it everywhere.
46827	09-May-1999	mckusick	Put back changes that might be causing trouble on Alpha.
46676	08-May-1999	phk	I got tired of seeing all the cdevsw[major(foo)] all over the place. Made a new (inline) function devsw(dev_t dev) and substituted it. Changed to the BDEV variant to this format as well: bdevsw(dev_t dev) DEVFS will eventually benefit from this change too.
46635	07-May-1999	phk	Continue where Julian left off in July 1998: Virtualize bdevsw[] from cdevsw. bdevsw() is now an (inline) function. Join CDEV_MODULE and BDEV_MODULE to DEV_MODULE (please pay attention to the order of the cmaj/bmaj arguments!) Join CDEV_DRIVER_MODULE and BDEV_DRIVER_MODULE to DEV_DRIVER_MODULE (ditto!) (Next step will be to convert all bdev dev_t's to cdev dev_t's before they get to do any damage^H^H^H^H^H^Hwork in the kernel.)
46618	07-May-1999	mckusick	Whitespace cleanup.
46616	07-May-1999	mckusick	Get rid of random debugging cruft; sync up with latest version.
46609	07-May-1999	mckusick	Severe slowdowns have been reported when creating or removing many files at once on a filesystem running soft updates. The root of the problem is that soft updates limits the amount of memory that may be allocated to dependency structures so as to avoid hogging kernel memory. The original algorithm just waited for the disk I/O to catch up and reduce the number of dependencies. This new code takes a much more aggressive approach. Basically there are two resources that routinely hit the limit. Inode dependencies during periods with a high file creation rate and file and block removal dependencies during periods with a high file removal rate. I have attacked these problems from two fronts. When the inode dependency limits are reached, I pick a random inode dependency, UFS_UPDATE it together with all the other dirty inodes contained within its disk block and then write that disk block. This trick usually clears 5-50 inode dependencies in a single disk I/O. For block and file removal dependencies, I pick a random directory page that has at least one remove pending and VOP_FSYNC its directory. That releases all its removal dependencies to the work queue. To further hasten things along, I also immediately start the work queue process rather than waiting for its next one second scheduled run.
46568	06-May-1999	peter	Add sufficient braces to keep egcs happy about potentially ambiguous if/else nesting.
46349	02-May-1999	alc	The VFS/BIO subsystem contained a number of hacks in order to optimize piecemeal, middle-of-file writes for NFS. These hacks have caused no end of trouble, especially when combined with mmap(). I've removed them. Instead, NFS will issue a read-before-write to fully instantiate the struct buf containing the write. NFS does, however, optimize piecemeal appends to files. For most common file operations, you will not notice the difference. The sole remaining fragment in the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache coherency issues with read-merge-write style operations. NFS also optimizes the write-covers-entire-buffer case by avoiding the read-before-write. There is quite a bit of room for further optimization in these areas. The VM system marks pages fully-valid (AKA vm_page_t->valid = VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This is not correct operation. The vm_pager_get_pages() code is now responsible for marking VM pages all-valid. A number of VM helper routines have been added to aid in zeroing-out the invalid portions of a VM page prior to the page being marked all-valid. This operation is necessary to properly support mmap(). The zeroing occurs most often when dealing with file-EOF situations. Several bugs have been fixed in the NFS subsystem, including bits handling file and directory EOF situations and buf->b_flags consistancy issues relating to clearing B_ERROR & B_INVAL, and handling B_DONE. getblk() and allocbuf() have been rewritten. B_CACHE operation is now formally defined in comments and more straightforward in implementation. B_CACHE for VMIO buffers is based on the validity of the backing store. B_CACHE for non-VMIO buffers is based simply on whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear, and vise-versa). biodone() is now responsible for setting B_CACHE when a successful read completes. B_CACHE is also set when a bdwrite() is initiated and when a bwrite() is initiated. VFS VOP_BWRITE routines (there are only two - nfs_bwrite() and bwrite()) are now expected to set B_CACHE. This means that bowrite() and bawrite() also set B_CACHE indirectly. There are a number of places in the code which were previously using buf->b_bufsize (which is DEV_BSIZE aligned) when they should have been using buf->b_bcount. These have been fixed. getblk() now clears B_DONE on return because the rest of the system is so bad about dealing with B_DONE. Major fixes to NFS/TCP have been made. A server-side bug could cause requests to be lost by the server due to nfs_realign() overwriting other rpc's in the same TCP mbuf chain. The server's kernel must be recompiled to get the benefit of the fixes. Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
46155	28-Apr-1999	phk	This Implements the mumbled about "Jail" feature. This is a seriously beefed up chroot kind of thing. The process is jailed along the same lines as a chroot does it, but with additional tough restrictions imposed on what the superuser can do. For all I know, it is safe to hand over the root bit inside a prison to the customer living in that prison, this is what it was developed for in fact: "real virtual servers". Each prison has an ip number associated with it, which all IP communications will be coerced to use and each prison has its own hostname. Needless to say, you need more RAM this way, but the advantage is that each customer can run their own particular version of apache and not stomp on the toes of their neighbors. It generally does what one would expect, but setting up a jail still takes a little knowledge. A few notes: I have no scripts for setting up a jail, don't ask me for them. The IP number should be an alias on one of the interfaces. mount a /proc in each jail, it will make ps more useable. /proc/<pid>/status tells the hostname of the prison for jailed processes. Quotas are only sensible if you have a mountpoint per prison. There are no privisions for stopping resource-hogging. Some "#ifdef INET" and similar may be missing (send patches!) If somebody wants to take it from here and develop it into more of a "virtual machine" they should be most welcome! Tools, comments, patches & documentation most welcome. Have fun... Sponsored by: http://www.rndassociates.com/ Run for almost a year by: http://www.servetheweb.com/
46124	27-Apr-1999	msmith	Simplify the tunefs example, since tunefs uses getfsfile(). Lots of people complain about working out what device their filesystems are mounted on.
46112	27-Apr-1999	phk	Suser() simplification: 1: s/suser/suser_xxx/ 2: Add new function: suser(struct proc ), prototyped in <sys/proc.h>. 3: s/suser_xxx($[a-zA-Z0-9_]$->p_ucred, \&\1->p_acflag)/suser(\1)/ The remaining suser_xxx() calls will be scrutinized and dealt with later. There may be some unneeded #include <sys/cred.h>, but they are left as an exercise for Bruce. More changes to the suser() API will come along with the "jail" code.
45911	21-Apr-1999	dt	Change type of a variable from u_int to size_t, so that pointer to it may be used as a last argument to copyinstr().
45570	11-Apr-1999	eivind	Correct typo in panic message
45362	06-Apr-1999	peter	Hold the mfs process's upages in-core with PHOLD rather than P_NOSWAP.
45347	05-Apr-1999	julian	Catch a case spotted by Tor where files mmapped could leave garbage in the unallocated parts of the last page when the file ended on a frag but not a page boundary. Delimitted by tags PRE_MATT_MMAP_EOF and POST_MATT_MMAP_EOF, in files alpha/alpha/pmap.c i386/i386/pmap.c nfs/nfs_bio.c vm/pmap.h vm/vm_page.c vm/vm_page.h vm/vnode_pager.c miscfs/specfs/spec_vnops.c ufs/ufs/ufs_readwrite.c kern/vfs_bio.c Submitted by: Matt Dillon <dillon@freebsd.org> Reviewed by: Alan Cox <alc@freebsd.org>
45332	05-Apr-1999	peter	There's not much point in the EXPORTMFS #ifdef. I've had this sitting in my tree for 12+ months, and I just noticed that NetBSD have (I think, I've just seen the commit, not the change) just zapped it there. It wasn't in the options files or LINT either.
44675	12-Mar-1999	julian	Stop the mfs from trying to swap out crucial bits of the mfs as this can lead to deadlock. Submitted by: Mat dillon <dillon@freebsd.org>
44512	06-Mar-1999	bde	Don't depend on <ufs/ufs/quota.h> or another (old) prerequisite including <sys/queue.h>. This fixes my recent breakage of biosboot by unpolluting <ufs/ufs/quota.h> in the !KERNEL case.
44480	05-Mar-1999	bde	Moved kernel declarations inside the KERNEL ifdef, and removed include of <sys/queue.h> in the !KERNEL case. The prerequisites for <ufs/ufs/quota.h> were broken in Lite2 by converting some of the kernel declarations to use queue macros without including <sys/queue.h>. <sys/queue.h> was included in applications in /usr/src instead. We polluted this file instead of merging the changes in the applications. Include <sys/queue.h> in the KERNEL case, and forward-declare all structs that are used in prototypes, so that this file is almost self-sufficient even in the kernel. Obtained from: mostly from NetBSD
44474	05-Mar-1999	bde	Changed the type of quotactl()'s 4th arg from `char ' to `void ' so that non-sloppy applications can call it without using disgusting casts to avoid warnings. The 4th arg is sort of varargs -- it must sometimes represent a filename, sometimes a struct pointer, and is sometimes unused. The arg type is still caddr_t in the kernel. Obtained from: mostly from NetBSD
44398	02-Mar-1999	mckusick	Reorganize locking to avoid holding the lock during calls to bdwrite and brelse (which may sleep in some systems). Obtained from: Matthew Dillon <dillon@apollo.backplane.com>
44395	02-Mar-1999	imp	Merge patch to ufs_vnops.c's ufs_rename to the copy of ufs_rename that lives in ext2_vnops.c for ext2fs. Also remove cast from comparision. Bruce pointed out that it was bogus since we'd force a signed comparision when we really wanted an unsigned comparison.
44391	02-Mar-1999	mckusick	When fsync'ing a file on a filesystem using soft updates, we first try to write all the dirty blocks. If some of those blocks have dependencies, they will be remarked dirty when the I/O completes. On systems with really fast I/O systems, it is possible to get in an infinite loop trying to flush the buffers, because the I/O finishes before we can get all the dirty buffers off the v_dirtyblkhd list and into the I/O queue. (The previous algorithm looped over the v_dirtyblkhd list writing out buffers until the list emptied.) So, now we mark each buffer that we try to write so that we can distinguish the ones that are being remarked dirty from those that we have not yet tried to flush. Once we have tried to push every buffer once, we then push any associated metadata that is causing the remaining buffers to be redirtied. Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
44383	02-Mar-1999	mckusick	Ensure that softdep_sync_metadata can handle bmsafemap and mkdir entries if they ever arise (which should not happen as softdep_sync_metadata is currently used).
44291	26-Feb-1999	imp	Fix last commit based on feedback from Guido, Bruce and Terry. Specifically, the test was in the wrong place, lacked a cast, didn't unlock the node, and exited to bad rather than abortit. Now we don't allow renaming of a file with LINK_MAX references. Move the test to earlier in the code as it is closer to where ip is obtained, as that is the style of the rest of the function. Didn't fix the problems bruce pointed out in the rename man page to include EMLINK, nor address his complaints about how the whole idea of incrementing the link count during a rename is potentially asking for trouble. Also didn't try to correct potential problem Terry pointed out with decrements not being similarly protected against underflow.
44253	25-Feb-1999	imp	Add missing check for LINK_MAX in ufs_rename. Since ip->i_effnlink and ip->nlink were different types, there was a masked overflow. Reported by: Mark Slemco <marcs@znep.com>
44248	25-Feb-1999	dillon	Update ufs_vnops code to use new specinfo fields rather then guess. This is part of general specinfo / d_parms() commit.
44102	17-Feb-1999	mckusick	fix double LIST_REMOVE; other cosmetic changes to match version 9.32. Obtained from: Jeffrey Hsu <hsu@FreeBSD.ORG>
43958	13-Feb-1999	dillon	Remove XXX comment in regarsd to why NFS doesn't use VOP_ABORT(). NFS is being fixed now.
43311	28-Jan-1999	dillon	Fix warnings in preparation for adding -Wall -Wcast-qual to the kernel compile
43287	27-Jan-1999	dillon	Remove unintended trigraph sequences in comments for -Wall
43044	22-Jan-1999	dg	Gutted softdep_deallocate_dependencies and replaced it with a panic. It turns out to not be useful to unwind the dependencies and continue in the face of a fatal error. Also changed the log() to a printf() in softdep_error() so that it will be output in the case of a impending panic. Submitted by: Kirk McKusick <mckusick@mckusick.com>
42965	21-Jan-1999	dillon	Added support for VOP_FREEBLKS(), reducing MFS's impact on swap and increasing performance by deallocating at least some of the backing store when files are removed. Protect mfsp->buf_queue access at splbio().
42964	21-Jan-1999	dillon	Access to mfsp->buf_queue must be protected at splbio(). Other minor adjustments also made, such as passing mfsp to mfs_doio() directly.
42957	21-Jan-1999	dillon	This is a rather large commit that encompasses the new swapper, changes to the VM system to support the new swapper, VM bug fixes, several VM optimizations, and some additional revamping of the VM code. The specific bug fixes will be documented with additional forced commits. This commit is somewhat rough in regards to code cleanup issues. Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
42567	12-Jan-1999	eivind	Silence warning about unused debug function. (I'll turn this function into a DDB command in my next staticization sweep).
42400	08-Jan-1999	eivind	Add a warning about the copyright restraints.
42374	07-Jan-1999	bde	Don't pass unused unused timestamp args to UFS_UPDATE() or waste time initializing them. This almost finishes centralizing (in-core) timestamp updates in ufs_itimes().
42354	06-Jan-1999	bde	UFS_UPDATE() takes a boolean `waitfor' arg, so don't pass it the value MNT_WAIT when we mean boolean `true' or check for that value not being passed. There was no problem in practice because MNT_WAIT had the magic value of 1.
42351	06-Jan-1999	bde	Ifdefed the conditionally used variable `prtrealloc'. Declare it as volatile so that there is no chance that the code that it controls is optimised away.
42350	06-Jan-1999	bde	Backed out rev.1.47. It just broke my optimisations for lazy syncing of timestamps in rev.1.45. The soft updates bug was elsewhere. Forgotten by: luoqi
42315	05-Jan-1999	eivind	Remove the 'waslocked' parameter to vfs_object_create().
42248	02-Jan-1999	bde	Ifdefed conditionally used simplock variables.
42244	02-Jan-1999	eivind	Remove the last clients of vfs_object_create(..., waslocked=1); waslocked will go away shortly. Reviewed by: dg
42216	01-Jan-1999	dillon	The mount_mfs process that stays in a supervisor context handling MFS I/O requests must be marked P_SYSTEM because if it isn't and the system decides to swap it or (god forbid) kill it, the system stands a good chance of locking up.
42042	24-Dec-1998	bde	Fixed null pointer panics which I introduced in rev.1.86. Vnodes may be revoked, so vnop routines must be careful about accessing the vnode if they may have blocked. Fixed marking for update after successfully reading or writing 0 bytes. In this case, POSIX.1 specifies marking if and only if the requested count is nonzero, but rev.1.86 never marked.
41961	20-Dec-1998	bde	Remove unused file. It seems to have been a vestige of when mfs did its own memory allocation.
41954	20-Dec-1998	dfr	In ufs_setattr(), if only one of va_atime or va_mtime are != VNOVAL, then the code set the other field in the inode to VNOVAL. This can happen sometimes on an NFS server.
41809	15-Dec-1998	julian	Add comments to code that I was trying to understand. Hopefully will save others time. Someone who understands this better might check for correctness.
41765	14-Dec-1998	dillon	Fix -Wuninitialized warning regarding zero-length var-args ctl element. ( this isn't really an error, but I think it is important to fix the warning ).
41659	10-Dec-1998	julian	Remove some compiler warnings.
41610	09-Dec-1998	eivind	Make compare correct with unsigned types. (Problem introduced by Lite/2).
41591	07-Dec-1998	archie	The "easy" fixes for compiling the kernel -Wunused: remove unreferenced static and local variables, goto labels, and functions declared but not defined.
41395	29-Nov-1998	bde	Don't use the strange null pointer constant `(ufs_daddr_t)0' in a call to VOP_BMAP(). Don't use uncast NULLs in the same call.
41124	13-Nov-1998	dg	Restored the "reallocblks" code to its former glory. What this does is basically do a on-the-fly defragmentation of the FFS filesystem, changing file block allocations to make them contiguous. Thanks to Kirk McKusick for providing hints on what needed to be done to get this working.
41059	10-Nov-1998	peter	add #include <sys/kernel.h> where it's needed by MALLOC_DEFINE()
40791	31-Oct-1998	peter	Change dirty block list handling to use TAILQ macros.
40790	31-Oct-1998	peter	Use TAILQ macros for clean/dirty block list processing. Set b_xflags rather than abusing the list next pointer with a magic number.
40692	28-Oct-1998	jkh	Clarify a rather ambiguous debugging message.
40672	27-Oct-1998	bde	Oops, the redundant tests for major numbers weren't redundant here. They checked for the magic major number for the "device" behind mfs mount points. Use a more obvious check for this device. Debugged by: Andrew Gallatin <gallatin@cs.duke.edu>
40660	26-Oct-1998	bde	Removed redundant bitrotted checks for major numbers instead of updating them.
40649	25-Oct-1998	bde	Don't follow null bdevsw pointers. The `major(dev) < nblkdev' test rotted when bdevsw[] became sparse. We still depend on magic to avoid having to check that (v_rdev) device numbers in vnodes are not NODEV. Removed redundant `major(dev) < nblkdev' tests instead of updating them.
40648	25-Oct-1998	phk	Nitpicking and dusting performed on a train. Removes trivial warnings about unused variables, labels and other lint.
40469	17-Oct-1998	bde	Use only the correct raw partition for writing labels. Don't use the partition that the label ioctl is being done on just because it has offset 0, since there is no guarantee that such a partition is large enough to contain the label. Don't use the wrong raw partition (0 instead of RAW_PART). This fixes problems rewriting bizarre labels (with a nonzero offset for the 'a' partition) in newfs(8). Such labels shouldn't normally be used, but creating them was allowed if the ioctl was done on the raw partition, and sysinstall creates them if the root partition isn't allocated first. Note that allowing write access to a partition other than the one that has been checked for write access doesn't increase security holes significantly, since write access to any partition already allows changing the in-core label. This fix should be in 3.0R. Rev.1.26 of newfs/newfs.c shouldn't be in 3.0R.
40448	16-Oct-1998	jkh	fixup for alpha.
40304	13-Oct-1998	bde	Fixed bloatage of `struct inode'. We used 5 "spare" fields for ext2fs, but when i_effnlink was added to support soft updates, there was only room for 4 spares. The number of spares was not reduced, so the inode size became 260 (on i386's), or 512 after rounding up by malloc(). Use one spare field in `struct dinode' instead of the 5th spare field in the inode and reduced to 4 spares in the inode so that the size is 256 again. Changed the types of the spares in the inode from int to u_int32_t so that the inode size has more chance of being <= 256 under other arches, and downdated ext2fs to match (it was broken to use ints before rev.1.1).
40251	12-Oct-1998	peter	"fix" a warning
40164	10-Oct-1998	jkh	Allow more flexible use of MFS root. Submitted by: peter
40153	09-Oct-1998	peter	MODINFO_ADDR has real addresses now, remove the manual relocation based on cpu type.
40099	09-Oct-1998	jkh	Add some evil temporary phys-to-kern translation for mfs.
40094	09-Oct-1998	jkh	include proper header for Mike's new stuff.
40084	08-Oct-1998	jkh	Allow the module area to be used in order to find the MFS image (in addition to allowing it to be compiled in) and stop overloading the MFS_ROOT variable to store size information.
40038	07-Oct-1998	luoqi	Use vm_page_xxx() inline functions to manipulate vm_page::flags, vm_page::busy. As a side effect, a few wakeup() calls are added, which might fix some of the missing vm_page wakeups people have been seeing. Reviewed by: Doug Rabson <dfr@nlsystems.com>
39933	03-Oct-1998	nate	Fix 'noatime' bug that was unrelated to use of noatime. The problem is caused when a directory block is compacted. When this occurs, softdep_change_directoryentry_offset() is called to relocate each directory entry and adjust its matching diradd structure, if any, to match the new location of the entry. The bug is that while softdep_change_directoryentry_offset() correctly adjusts the offsets of the diradd structures on the pd_diraddhd[] lists (which are not yet ready to be committed to disk), it fails to adjust the offsets of the diradd structures on the pd_pendinghd list (which are ready to be committed to disk). This causes the dependency structures to be inconsistent with the buf contents. Now, if the compaction has moved a directory entry to the same offset as one of the diradd structures on the pd_pendinghd list and a syscall is done that tries to remove this directory entry before this directory block has been written to disk (which would empty pd_pendinghd), a sanity check in newdirrem() will call panic() when it notices that the inode number in the entry that it is to be removed doesn't match the inode number in the diradd structure with that offset of that entry. Reviewed by: Kirk McKusick <mckusick@McKusick.COM> Submitted by: Don Lewis <Don.Lewis@tsc.tdk.com>
39796	30-Sep-1998	mckusick	Do not allow a mounted on directory to be rmdir'ed. This removal can happen when an NFS exported filesystem tries to remove a locally mounted on directory. PR: kern/7272 Submitted by: Andre Albsmeier <andre.albsmeier@mchp.siemens.de>
39669	26-Sep-1998	bde	Fixed clean flag handling: - don't set the clean flag on unmount of an unclean filesystem that was (forcibly) mounted rw. - set the clean flag on rw -> ro update of a mounted initially-clean filesystem. - fixed some style bugs (mostly long lines). This uses the fs_flags field and FS_UNCLEAN state bit which were introduced in the softdep changes. NetBSD uses extra state bits in fs_clean. Reviewed by: luoqui
39623	24-Sep-1998	luoqi	Eliminate a race in VOP_FSYNC() when softupdates is enabled. Submitted by: Kirk McKusick <mckusick@McKusick.COM> Two minor changes are also included, 1. Remove gratuitious checks for error return from vn_lock with LK_RETRY set, vn_lock should always succeed in these cases. 2. Back out change rev. 1.36->1.37, which unnecessarily makes async mount a little more unstable. It also keeps us in sync with other BSDs. Suggested by: Bruce Evans <bde@zeta.org.au>
39281	15-Sep-1998	luoqi	Restore pre-v1.44 behavior: always copy modified in-core inode to disk buffer. Otherwise some in-core inode changes might be lost, including important meta data (e.g. size) if softupdates is enabled.
39238	15-Sep-1998	gibbs	When a buffer is removed from a buffer queue, remember it's block number and use it as "the currently active" buffer in doing disk sort calculations.
39187	14-Sep-1998	sos	Remove the SLICE code. This clearly needs alot more thought, and we dont need this to hunt us down in 3.0-RELEASE.
39099	12-Sep-1998	bde	Don't dereference an uninitialized pointer in dead code. The dead code gets executed if it is compiled without optimization.
38909	07-Sep-1998	bde	Removed statically configured mount type numbers (MOUNT_) and all references to them. The change a couple of days ago to ignore these numbers in statically configured vfsconf structs was slightly premature because the cd9660, cfs, devfs, ext2fs, nfs vfs's still used MOUNT_ instead of the number in their vfsconf struct.
38907	07-Sep-1998	bde	Put the zombie ffs sysctl node in "notyet" state together with its few remaining children. Prepare it for MOUNT_UFS going away.
38902	07-Sep-1998	phk	Make MFS do the default on VOP_FREEBLKS(). XXX: we could deallocate the storage, but somebody else will have to pick up that task.
38862	05-Sep-1998	phk	Add a new vnode op, VOP_FREEBLKS(), which filesystems can use to inform device drivers about sectors no longer in use. Device-drivers receive the call through d_strategy, if they have D_CANFREE in d_flags. This allows flash based devices to erase the sectors and avoid pointlessly carrying them around in compactions. Reviewed by: Kirk Mckusick, bde Sponsored by: M-Systems (www.m-sys.com)
38418	18-Aug-1998	bde	Quick fix for breakage of read clustering on non-IDE drives. Read clustering is obsolescent technology so hardly anyone noticed. On a DORS 32160 SCSI drive with 4 tags, read clustering makes very little difference even for huge sequential reads. However, on a ZIP SCSI drive with 0 tags, the minimum overhead per block is about 40 msec, so very large clusters must be used to get anywhere near the maximum transfer rate. Using clusters consisting of 1 8K block reduces the transfer rate to about 250K/sec. Under msdosfs, missing read clustering is normal and a cluster size of 1 512 byte block reduces the transfer rate to about 25K/sec. Broken in: rev.1.18
38408	17-Aug-1998	bde	Removed unused includes.
38292	12-Aug-1998	msmith	"The releaseing of the reference and lock is not temporary and belongs where it is. The reference and lock(s) are acquired just above the code in VREF() and relookup()." Submitted by: Michael Hancock <michaelh@cet.co.jp>
38291	12-Aug-1998	julian	Handle the case of moving a directory onto the top of a sibling's child of the same name. Submitted by: Kirk Mckusick with fixes from luoqi Chen Obtained from: Whistle test tree.
37922	28-Jul-1998	bde	Used daddr_t's, not ints, to store disk block numbers. Updated printf formats and args to match. Fixed old printf format errors (all related; most were hidden by calling printf indirectly). This change somehow avoids compiler bugs for 64-bit longs on i386's, although it increases the number of 64-bit calculations.
37887	27-Jul-1998	bde	Made lazy syncing of timestamps for special files non-optional.
37649	15-Jul-1998	bde	Cast pointers to uintptr_t/intptr_t instead of to u_long/long, respectively. Most of the longs should probably have been u_longs, but this changes is just to prevent warnings about casts between pointers and integers of different sizes, not to fix poorly chosen types.
37555	11-Jul-1998	bde	Fixed printf format errors.
37539	10-Jul-1998	julian	Add code missed in the initial Soft updates integration. Make the unallocated parts of a directry have a know state in case we need it later.
37520	08-Jul-1998	julian	Don't update superblock if mounted readonly, also fixes some problems with softupdates on root. More cleanups are needed here.. Submitted by: Luoqi Chen <luoqi@watermarkgroup.com>
37490	08-Jul-1998	julian	Catch a few corner cases where FreeBSD differs enough from BSD 4.4 to confuse Soft updates.. Should solve several "dangling deps" panics.
37384	04-Jul-1998	julian	VOP_STRATEGY grows an (struct vnode *) argument as the value in b_vp is often not really what you want. (and needs to be frobbed). more cleanups will follow this. Reviewed by: Bruce Evans <bde@freebsd.org>
37364	03-Jul-1998	bde	Restored revs.1.89-1.90 which I somehow clobbered in rev.1.91.
37363	03-Jul-1998	bde	Sync timestamp changes for inodes of special files to disk as late as possible (when the inode is reclaimed). Temporarily only do this if option UFS_LAZYMOD configured and softupdates aren't enabled. UFS_LAZYMOD is intentionally left out of /sys/conf/options. This is mainly to avoid almost useless disk i/o on battery powered machines. It's silly to write to disk (on the next sync or when the inode becomes inactive) just because someone hit a key or something wrote to the screen or /dev/null. PR: 5577 Previous version reviewed by: phk
37362	03-Jul-1998	bde	Centralized in-core inode update. Update the in-core inode directly in ufs_setattr() so that there is no need to pass timestamps to UFS_UPDATE() (everything else just needs the current time). Ignore the passed-in timestamps in UFS_UPDATE() and always call ufs_itimes() (was: itimes()) to do the update. The timestamps are still passed so that all the callers don't need to be changed yet.
37182	27-Jun-1998	phk	Make vprint() print dev_t in hex also.
37181	27-Jun-1998	phk	Report the type from the inode, not the vnode.
37167	26-Jun-1998	jkh	Flesh this document out just a little in response to some user questions and also recommend linking over copying since, at this stage, a stale copy is a real concern.
37094	21-Jun-1998	bde	Removed unused includes.
36990	14-Jun-1998	julian	Slight change to directory cleanup Makes soft updates a bit cleaner. Eliminates some warnings about 'corrupted directories' from fsck.
36936	12-Jun-1998	julian	Note which version of Kirk's sources this corresponds to.
36935	12-Jun-1998	julian	Fix the case when renaming to a file that you've just created and deleted, that had an inode that has not yet been written to disk, when the inode of the new file is also not yet written to disk, and your old directory entry is not yet on disk but you need to remove it and the new name exists in memory but has been deleted but the transaction to write the deleted name to disk exists and has not yet been cancelled by the request to delete the non existant name. I don't know how kirk could have missed such a glaring problem for so long. :-) Especially since the inconsitency survived on the disk for a whole 4 second on average before being fixed by other code. This was not a crashing bug but just led to filesystem inconsitencies if you crashed. Submitted by: Kirk McKusick (mckusick@mckusick.com)
36900	11-Jun-1998	julian	Add B_NOCACHE to several cases where BSD4.4 only required a B_INVAL. Change worked out by john and kirk in consort.
36871	10-Jun-1998	julian	Fix for "live inode" panic. Submitted by: Kirk McKusick <mckusick@McKusick.COM> Reviewed by: yeah right...
36866	10-Jun-1998	julian	Remove buggy debugging code.
36863	10-Jun-1998	julian	Back out John's changes 1.45 -> 1.46 Kirk confirms that the original semantic was what he wanted... (well, a very slight difference) May fix "dangling deps" panic with soft updates.
36779	08-Jun-1998	julian	The version of the softdep changes in FreeBSD broke the (doingdirectory && !newparent) case of ufs_rename(). rename("D1/X/", "D2/Y/") gives a wrong link count for D2. Submitted by: Bruce Evans <bde@zeta.org.au> Reviewed by: Kirk McKusick <mckusick@McKusick.COM>
36723	07-Jun-1998	bde	Null change. Forgot to mention in previous log message that MNT_NOATIME is now ignored for special files, so that mounting root with option noatime doesn't break reporting of idle times in programs like `w'. The problem of execessive disk updates just to stamp atimes will be handled for special files by only writing atimes to disk when inodes become active. This works well because special files are relatively uncommon and their atimes are even more disposable at panic time than regular files' atimes.
36721	07-Jun-1998	bde	Fixed some longstanding timestamp bugs: 1. mark atimes and mtimes of special files and fifos for update upon successful completion of non-null i/o, not at the beginning of the syscall. 2. never update file times for readonly filesystems. They were updated for stats and closes but not for syncs. The updates were of course only in-core and were thrown away when the inode was uncached, so the times sometimes appeared to go backwards. Improved comments in code related to (1) (mostly by removing them). Unmacroized ITIMES(). The test in (2) bloated it even more. Don't call getmicrotime() in the function version of it when we only need the time in seconds.
36646	04-Jun-1998	dfr	Use size_t instead of u_int for sizes.
36645	04-Jun-1998	dfr	If the filesystem blocksize is less than the VM page size, use the generic getpages code. This happens for filesystems with 4k pages on the alpha since the normal alpha pagesize is 8k.
36644	04-Jun-1998	dfr	Don't cast a pointer to an int in DQHASH.
36581	02-Jun-1998	julian	Add a reference to the original softupdates paper
36580	02-Jun-1998	julian	Add a reference to the Ganger/Patt paper
36404	27-May-1998	julian	A fix to a debug test from Kirk.
36235	19-May-1998	julian	Ensure that there is enough information here, so that people can use soft updates should they desire.
36234	19-May-1998	julian	Bring up-to-date with Whistle's current version Includes some debugging code.
36232	19-May-1998	julian	Merge with Kirk's version as of Feb 20 His version 9.23 == our version 1.5 of ffs_softdep.c His version 9.5 == our version 1.4 of softdep.c
36225	19-May-1998	julian	Merge in Kirk's changes to stop softupdates from hogging all of memory.
36212	19-May-1998	julian	Change to stop a silly panic. This should be understood better. Change a buffer swizzle trick to a bcopy. It would be nice if the efficient trick could be used in the future.
36210	19-May-1998	julian	First published FreeBSD version of soft updates Feb 5.
36207	19-May-1998	julian	This commit was generated by cvs2svn to compensate for changes in r36206, which included commits to RCS files with non-trunk default branches.
36202	19-May-1998	julian	This commit was generated by cvs2svn to compensate for changes in r36201, which included commits to RCS files with non-trunk default branches.
36147	18-May-1998	julian	try stop the user from using mount -u to set the async flag on a filesystem currently using soft updates. Also needs a new copy of ffs_softdep.c to complete the fix.
36119	17-May-1998	phk	s/nanoruntime/nanouptime/g s/microruntime/microuptime/g Reviewed by: bde
35955	11-May-1998	julian	Add missing splx() Submitted by: Luoqi Chen <luoqi@chen.ml.org>
35954	11-May-1998	julian	Submitted by: abial@nask.pl Minor fix to support SLICE in MFS...
35823	07-May-1998	msmith	In the words of the submitter: --------- Make callers of namei() responsible for releasing references or locks instead of having the underlying filesystems do it. This eliminates redundancy in all terminal filesystems and makes it possible for stacked transport layers such as umapfs or nullfs to operate correctly. Quality testing was done with testvn, and lat_fs from the lmbench suite. Some NFS client testing courtesy of Patrik Kudo. vop_mknod and vop_symlink still release the returned vpp. vop_rename still releases 4 vnode arguments before it returns. These remaining cases will be corrected in the next set of patches. --------- Submitted by: Michael Hancock <michaelh@cet.co.jp>
35769	06-May-1998	msmith	As described by the submitter: Reverse the VFS_VRELE patch. Reference counting of vnodes does not need to be done per-fs. I noticed this while fixing vfs layering violations. Doing reference counting in generic code is also the preference cited by John Heidemann in recent discussions with him. The implementation of alternative vnode management per-fs is still a valid requirement for some filesystems but will be revisited sometime later, most likely using a different framework. Submitted by: Michael Hancock <michaelh@cet.co.jp>
35696	04-May-1998	dyson	Correct an error that I made where the vtruncbuf was changed back to vinvalbuf, but I incorrectly added the "V_SAVE\|V_SAVEMETA" flags. Submitted by: Luoqi Chen <luoqi@watermarkgroup.com>
35526	30-Apr-1998	dyson	Fix an error that I made with an optimization. In the case of softupdates, we need to do vtruncbuf the old way. Luoqi caught, found the bug and submitted this fix. Submitted by: Luoqi Chen <luoqi@chen.ml.org>
35323	20-Apr-1998	julian	Make the devfs SLICE option a standard type option. (hopefully it will go away eventually anyhow)
35319	19-Apr-1998	julian	Add changes and code to implement a functional DEVFS. This code will be turned on with the TWO options DEVFS and SLICE. (see LINT) Two labels PRE_DEVFS_SLICE and POST_DEVFS_SLICE will deliniate these changes. /dev will be automatically mounted by init (thanks phk) on bootup. See /sys/dev/slice/slice.4 for more info. All code should act the same without these options enabled. Mike Smith, Poul Henning Kamp, Soeren, and a few dozen others This code does not support the following: bad144 handling. Persistance. (My head is still hurting from the last time we discussed this) ATAPI flopies are not handled by the SLICE code yet. When this code is running, all major numbers are arbitrary and COULD be dynamically assigned. (this is not done, for POLA only) Minor numbers for disk slices ARE arbitray and dynamically assigned.
35256	17-Apr-1998	des	Seventy-odd "its" / "it's" typos in comments fixed as per kern/6108.
35205	15-Apr-1998	bde	Fixed bitrot in the non-softdep case of ufs_dirremove(): - restored async mount support. The first entry in a block is still always written synchronously, although it probably shouldn't be in the async case. - restored use of BWRITE() instead of bowrite() for the DOWHITEOUT case, although bowrite() is probably better. Broken by: merge of softdep changes (rev.1.22). Found by: lmbench2 delete-file benchmarks.
35084	06-Apr-1998	peter	Back this out, allowing users to get a fd connected to a symlink is just too dangerous.
35083	06-Apr-1998	peter	Don't panic if a VOP_READ() gets through on a short link, Just Do It (because we can :-). This means you can open a link file (or pseudo-file in the case of short links where the data is stored in the inode rather than disk blocks) and read the contents. However, trap any writes from the user as it's difficult to do the right thing in all cases. A link may be short and the user may be trying to extend it beyond the limit and so on. Although.. being able to re-target a symlink without deleting it first might have been nice. This stuff is a bit perverse since symlink() and readlink() calls can end up actually being implemented as read/write vnode ops. Reviewed by: phk
35029	04-Apr-1998	phk	Time changes mark 2: * Figure out UTC relative to boottime. Four new functions provide time relative to boottime. * move "runtime" into struct proc. This helps fix the calcru() problem in SMP. * kill mono_time. * add timespec{add\|sub\|cmp} macros to time.h. (XXX: These may change!) * nanosleep, select & poll takes long sleeps one day at a time Reviewed by: bde Tested by: ache and others
34961	30-Mar-1998	phk	Eradicate the variable "time" from the kernel, using various measures. "time" wasn't a atomic variable, so splfoo() protection were needed around any access to it, unless you just wanted the seconds part. Most uses of time.tv_sec now uses the new variable time_second instead. gettime() changed to getmicrotime(0. Remove a couple of unneeded splfoo() protections, the new getmicrotime() is atomic, (until Bruce sets a breakpoint in it). A couple of places needed random data, so use read_random() instead of mucking about with time which isn't random. Add a new nfs_curusec() function. Mark a couple of bogosities involving the now disappeard time variable. Update ffs_update() to avoid the weird "== &time" checks, by fixing the one remaining call that passwd &time as args. Change profiling in ncr.c to use ticks instead of time. Resolution is the same. Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call hzto() which subtracts time" sequences. Reviewed by: bde
34924	28-Mar-1998	bde	Moved some #includes from <sys/param.h> nearer to where they are actually used.
34913	27-Mar-1998	peter	Enable the use of soft updates on the root filesystem. Previously, the softdep mode could only be activated on the initial mount of a filesystem and then only if it was a read-write mount. A 'mount -r' (as done in the rootfs mount) followed by a 'mount -u' to convert to read-write didn't start softdep mode.
34901	26-Mar-1998	phk	Add two new functions, get{micro\|nano}time. They are atomic, but return in essence what is in the "time" variable. gettime() is now a macro front for getmicrotime(). Various patches to use the two new functions instead of the various hacks used in their absence. Some puntuation and grammer patches from Bruce. A couple of XXX comments.
34826	23-Mar-1998	bde	Forward declare even more structs to restore some self-sufficiency. Didn't fix new dependence on <ufs/ufs/inode.h> and its prerequisites.
34734	21-Mar-1998	dyson	Softdep_sync_metadata appears to expect that it is called at splbio, so make it so...
34696	19-Mar-1998	dyson	Fix vfs_bio_awrite usage, and correct vtruncbuf usage.
34611	16-Mar-1998	dyson	Some VM improvements, including elimination of alot of Sig-11 problems. Tor Egge and others have helped with various VM bugs lately, but don't blame him -- blame me!!! pmap.c: 1) Create an object for kernel page table allocations. This fixes a bogus allocation method previously used for such, by grabbing pages from the kernel object, using bogus pindexes. (This was a code cleanup, and perhaps a minor system stability issue.) pmap.c: 2) Pre-set the modify and accessed bits when prudent. This will decrease bus traffic under certain circumstances. vfs_bio.c, vfs_cluster.c: 3) Rather than calculating the beginning virtual byte offset multiple times, stick the offset into the buffer header, so that the calculated offset can be reused. (Long long multiplies are often expensive, and this is a probably unmeasurable performance improvement, and code cleanup.) vfs_bio.c: 4) Handle write recursion more intelligently (but not perfectly) so that it is less likely to cause a system panic, and is also much more robust. vfs_bio.c: 5) getblk incorrectly wrote out blocks that are incorrectly sized. The problem is fixed, and writes blocks out ONLY when B_DELWRI is true. vfs_bio.c: 6) Check that already constituted buffers have fully valid pages. If not, then make sure that the B_CACHE bit is not set. (This was a major source of Sig-11 type problems.) vfs_bio.c: 7) Fix a potential system deadlock due to an incorrectly specified sleep priority while waiting for a buffer write operation. The change that I made opens the system up to serious problems, and we need to examine the issue of process sleep priorities. vfs_cluster.c, vfs_bio.c: 8) Make clustered reads work more correctly (and more completely) when buffers are already constituted, but not fully valid. (This was another system reliability issue.) vfs_subr.c, ffs_inode.c: 9) Create a vtruncbuf function, which is used by filesystems that can truncate files. The vinvalbuf forced a file sync type operation, while vtruncbuf only invalidates the buffers past the new end of file, and also invalidates the appropriate pages. (This was a system reliabiliy and performance issue.) 10) Modify FFS to use vtruncbuf. vm_object.c: 11) Make the object rundown mechanism for OBJT_VNODE type objects work more correctly. Included in that fix, create pager entries for the OBJT_DEAD pager type, so that paging requests that might slip in during race conditions are properly handled. (This was a system reliability issue.) vm_page.c: 12) Make some of the page validation routines be a little less picky about arguments passed to them. Also, support page invalidation change the object generation count so that we handle generation counts a little more robustly. vm_pageout.c: 13) Further reduce pageout daemon activity when the system doesn't need help from it. There should be no additional performance decrease even when the pageout daemon is running. (This was a significant performance issue.) vnode_pager.c: 14) Teach the vnode pager to handle race conditions during vnode deallocations.
34441	09-Mar-1998	dyson	Correct a problem with the ffs_getpages routine that manifest's itself during the tail command. The amount to read is incorrectly calculated. Submitted by: Tor Egge
34266	08-Mar-1998	julian	Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman) Submitted by: Kirk McKusick (mcKusick@mckusick.com) Obtained from: WHistle development tree
34248	08-Mar-1998	julian	Submitted by: kirk McKusick Stub file for soft updates.
34206	07-Mar-1998	dyson	This mega-commit is meant to fix numerous interrelated problems. There has been some bitrot and incorrect assumptions in the vfs_bio code. These problems have manifest themselves worse on NFS type filesystems, but can still affect local filesystems under certain circumstances. Most of the problems have involved mmap consistancy, and as a side-effect broke the vfs.ioopt code. This code might have been committed seperately, but almost everything is interrelated. 1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that are fully valid. 2) Rather than deactivating erroneously read initial (header) pages in kern_exec, we now free them. 3) Fix the rundown of non-VMIO buffers that are in an inconsistent (missing vp) state. 4) Fix the disassociation of pages from buffers in brelse. The previous code had rotted and was faulty in a couple of important circumstances. 5) Remove a gratuitious buffer wakeup in vfs_vmio_release. 6) Remove a crufty and currently unused cluster mechanism for VBLK files in vfs_bio_awrite. When the code is functional, I'll add back a cleaner version. 7) The page busy count wakeups assocated with the buffer cache usage were incorrectly cleaned up in a previous commit by me. Revert to the original, correct version, but with a cleaner implementation. 8) The cluster read code now tries to keep data associated with buffers more aggressively (without breaking the heuristics) when it is presumed that the read data (buffers) will be soon needed. 9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The delay loop waiting is not useful for filesystem locks, due to the length of the time intervals. 10) Correct and clean-up spec_getpages. 11) Implement a fully functional nfs_getpages, nfs_putpages. 12) Fix nfs_write so that modifications are coherent with the NFS data on the server disk (at least as well as NFS seems to allow.) 13) Properly support MS_INVALIDATE on NFS. 14) Properly pass down MS_INVALIDATE to lower levels of the VM code from vm_map_clean. 15) Better support the notion of pages being busy but valid, so that fewer in-transit waits occur. (use p->busy more for pageouts instead of PG_BUSY.) Since the page is fully valid, it is still usable for reads. 16) It is possible (in error) for cached pages to be busy. Make the page allocation code handle that case correctly. (It should probably be a printf or panic, but I want the system to handle coding errors robustly. I'll probably add a printf.) 17) Correct the design and usage of vm_page_sleep. It didn't handle consistancy problems very well, so make the design a little less lofty. After vm_page_sleep, if it ever blocked, it is still important to relookup the page (if the object generation count changed), and verify it's status (always.) 18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up. 19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush. 20) Fix vm_pager_put_pages and it's descendents to support an int flag instead of a boolean, so that we can pass down the invalidate bit.
34184	07-Mar-1998	bde	Fixed missing simple_lock() in ffs_mountfs().
33964	01-Mar-1998	msmith	The intent is to get rid of WILLRELE in vnode_if.src by making a complement to all ops that return a vpp, VFS_VRELE. This is initially only for file systems that implement the following ops that do a WILLRELE: vop_create, vop_whiteout, vop_mknod, vop_remove, vop_link, vop_rename, vop_mkdir, vop_rmdir, vop_symlink This is initial DNA that doesn't do anything yet. VFS_VRELE is implemented but not called. A default vfs_vrele was created for fs implementations that use the standard vnode management routines. VFS_VRELE implementations were made for the following file systems: Standard (vfs_vrele) ffs mfs nfs msdosfs devfs ext2fs Custom union umapfs Just EOPNOTSUPP fdesc procfs kernfs portal cd9660 These implementations may change as VOP changes are implemented. In the next phase, in the vop implementations calls to vrele and the vrele part of vput will be moved to the top layer vfs_vnops and made visible to all layers. vput will be replaced by unlock in these cases. Unlocking will still be done in the per fs layer but the refcount decrement will be triggered at the top because it doesn't hurt to hold a vnode reference a little longer. This will have minimal impact on the structure of the existing code. This will only be done for vnode arguments that are released by the various fs vop implementations. Wider use of VFS_VRELE will likely require restructuring of the code. Reviewed by: phk, dyson, terry et. al. Submitted by: Michael Hancock <michaelh@cet.co.jp>
33847	26-Feb-1998	msmith	In the author's words: These diffs implement the first stage of a VOP_{GET\|PUT}PAGES pushdown for local media FS's. See ffs_putpages in /sys/ufs/ufs/ufs_readwrite.c for implementation details for generic _{get\|put}pages for local media FS's. Support is trivial to add for any FS that formerly relied on the default behaviour of the vnode_pager in in EOPNOTSUPP cases (just copy the ffs_getpages() code for the FS in question's _{get\|put}pages). Obviously, it would be better if each local media FS implemented a more optimal method, instead of calling an exported interface from the /sys/vm/vnode_pager.c, but this is a necessary first step in getting the FS's to a point where they can be supplied with better implementations on a case-by-case basis. Obviously, the cd9660_putpages() can be rather trivial (since it is a read-only FS type 8-)). A slight (temporary) modification is made to print a diagnostic message in the case where the underlying filesystem attempts to engage in the previous behaviour. Failure is likely to be ungraceful. Submitted by: terry@freebsd.org (Terry Lambert)
33820	25-Feb-1998	bde	Fixed missing permissions checking for mounting by non-root. There is now less need for the vfs.usermount sysctl. msdosfs already has this change, modulo a missing LK_RETRY, via NetBSD. At least ext2fs is missing this and many other changes from Lite2. Obtained from: Lite2
33678	20-Feb-1998	bde	Don't depend on "implicit int".
33443	16-Feb-1998	msmith	Fix a panic resulting from executing off an MFS image. This corrects the recently observed problem with the install image. Submitted by: Tor Egge <Tor.Egge@idi.ntnu.no>
33289	13-Feb-1998	bde	Removed unnecessary dependencies on KERNEL and DIAGNOSTIC. This was more useful when opt_diagnostic.h had to be included.
33181	09-Feb-1998	eivind	Staticize.
33134	06-Feb-1998	eivind	Back out DIAGNOSTIC changes.
33109	05-Feb-1998	dyson	1) Start using a cleaner and more consistant page allocator instead of the various ad-hoc schemes. 2) When bringing in UPAGES, the pmap code needs to do another vm_page_lookup. 3) When appropriate, set the PG_A or PG_M bits a-priori to both avoid some processor errata, and to minimize redundant processor updating of page tables. 4) Modify pmap_protect so that it can only remove permissions (as it originally supported.) The additional capability is not needed. 5) Streamline read-only to read-write page mappings. 6) For pmap_copy_page, don't enable write mapping for source page. 7) Correct and clean-up pmap_incore. 8) Cluster initial kern_exec pagin. 9) Removal of some minor lint from kern_malloc. 10) Correct some ioopt code. 11) Remove some dead code from the MI swapout routine. 12) Correct vm_object_deallocate (to remove backing_object ref.) 13) Fix dead object handling, that had problems under heavy memory load. 14) Add minor vm_page_lookup improvements. 15) Some pages are not in objects, and make sure that the vm_page.c can properly support such pages. 16) Add some more page deficit handling. 17) Some minor code readability improvements.
33108	04-Feb-1998	eivind	Turn DIAGNOSTIC into a new-style option.
33054	03-Feb-1998	bde	Forward declare some structs so that this file is more self-sufficient.
32976	01-Feb-1998	dyson	Back out recent laptop sync changes. They had significant errors.
32951	01-Feb-1998	dyson	Support more intelligent sync operations for MNT_NOATIME. PR: kern/5577 Submitted by: Craig Leres <leres@ee.lbl.gov>
32944	31-Jan-1998	julian	Serves me right for not puting SUIDDIR in LINT. it got bitrot. This should stop complaints about it not working for people.
32889	30-Jan-1998	phk	Retire LFS. If you want to play with it, you can find the final version of the code in the repository the tag LFS_RETIREMENT. If somebody makes LFS work again, adding it back is certainly desireable, but as it is now nobody seems to care much about it, and it has suffered considerable bitrot since its somewhat haphazard integration. R.I.P
32726	24-Jan-1998	eivind	Make all file-system (MFS, FFS, NFS, LFS, DEVFS) related option new-style. This introduce an xxxFS_BOOT for each of the rootable filesystems. (Presently not required, but encouraged to allow a smooth move of option *FS to opt_dontuse.h later.) LFS is temporarily disabled, and will be re-enabled tomorrow.
32724	24-Jan-1998	dyson	Add better support for larger I/O clusters, including larger physical I/O. The support is not mature yet, and some of the underlying implementation needs help. However, support does exist for IDE devices now.
32702	22-Jan-1998	dyson	VM level code cleanups. 1) Start using TSM. Struct procs continue to point to upages structure, after being freed. Struct vmspace continues to point to pte object and kva space for kstack. u_map is now superfluous. 2) vm_map's don't need to be reference counted. They always exist either in the kernel or in a vmspace. The vmspaces are managed by reference counts. 3) Remove the "wired" vm_map nonsense. 4) No need to keep a cache of kernel stack kva's. 5) Get rid of strange looking ++var, and change to var++. 6) Change more data structures to use our "zone" allocator. Added struct proc, struct vmspace and struct vnode. This saves a significant amount of kva space and physical memory. Additionally, this enables TSM for the zone managed memory. 7) Keep ioopt disabled for now. 8) Remove the now bogus "single use" map concept. 9) Use generation counts or id's for data structures residing in TSM, where it allows us to avoid unneeded restart overhead during traversals, where blocking might occur. 10) Account better for memory deficits, so the pageout daemon will be able to make enough memory available (experimental.) 11) Fix some vnode locking problems. (From Tor, I think.) 12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp. (experimental.) 13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c code. Use generation counts, get rid of unneded collpase operations, and clean up the cluster code. 14) Make vm_zone more suitable for TSM. This commit is partially as a result of discussions and contributions from other people, including DG, Tor Egge, PHK, and probably others that I have forgotten to attribute (so let me know, if I forgot.) This is not the infamous, final cleanup of the vnode stuff, but a necessary step. Vnode mgmt should be correct, but things might still change, and there is still some missing stuff (like ioopt, and physical backing of non-merged cache files, debugging of layering concepts.)
32585	17-Jan-1998	dyson	Tie up some loose ends in vnode/object management. Remove an unneeded config option in pmap. Fix a problem with faulting in pages. Clean-up some loose ends in swap pager memory management. The system should be much more stable, but all subtile bugs aren't fixed yet.
32286	06-Jan-1998	dyson	Make our v_usecount vnode reference count work identically to the original BSD code. The association between the vnode and the vm_object no longer includes reference counts. The major difference is that vm_object's are no longer freed gratuitiously from the vnode, and so once an object is created for the vnode, it will last as long as the vnode does. When a vnode object reference count is incremented, then the underlying vnode reference count is incremented also. The two "objects" are now more intimately related, and so the interactions are now much less complex. When vnodes are now normally placed onto the free queue with an object still attached. The rundown of the object happens at vnode rundown time, and happens with exactly the same filesystem semantics of the original VFS code. There is absolutely no need for vnode_pager_uncache and other travesties like that anymore. A side-effect of these changes is that SMP locking should be much simpler, the I/O copyin/copyout optimizations work, NFS should be more ponderable, and further work on layered filesystems should be less frustrating, because of the totally coherent management of the vnode objects and vnodes. Please be careful with your system while running this code, but I would greatly appreciate feedback as soon a reasonably possible.
32161	01-Jan-1998	bde	Removed unused #includes again. They thrashed when mfs_reclaim thrashed to ufs_reclaim and back.
32072	29-Dec-1997	dyson	Fix the decl of vfs_ioopt, allow LFS to compile again, fix a minor problem with the object cache removal.
32071	29-Dec-1997	dyson	Lots of improvements, including restructring the caching and management of vnodes and objects. There are some metadata performance improvements that come along with this. There are also a few prototypes added when the need is noticed. Changes include: 1) Cleaning up vref, vget. 2) Removal of the object cache. 3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore. 4) Correct some missing LK_RETRY's in vn_lock. 5) Correct the page range in the code for msync. Be gentle, and please give me feedback asap.
32011	27-Dec-1997	bde	Unspammed nested include of <vm/vm_zone.h>.
31920	21-Dec-1997	dyson	I added vfs_ioopt prematurely, disabled.
31853	19-Dec-1997	dyson	Some performance improvements, and code cleanups (including changing our expensive OFF_TO_IDX to btoc whenever possible.)
31788	16-Dec-1997	eivind	Make LINT compile again after wollman introduced poll() here. Overlooked by: wollman
31749	15-Dec-1997	eivind	Convert SUIDDIR fully to a new-style option. Forgotten by: julian
31727	15-Dec-1997	wollman	Add support for poll(2) on files. vop_nopoll() now returns POLLNVAL if one of the new poll types is requested; hopefully this will not break any existing code. (This is done so that programs have a dependable way of determining whether a filesystem supports the extended poll types or not.) The new poll types added are: POLLWRITE - file contents may have been modified POLLNLINK - file was linked, unlinked, or renamed POLLATTRIB - file's attributes may have been changed POLLEXTEND - file was extended Note that the internal operation of poll() means that it is impossible for two processes to reliably poll for the same event (this could be fixed but may not be worth it), so it is not possible to rewrite `tail -f' to use poll at this time.
31699	13-Dec-1997	bde	Restored ufs_pathconf() from rev.1.61. vop_stdpathconf() is too general to be of much use. Using it here broke the _PC_NAME_MAX, _PC_NO_TRUNC and _PC_PATH_MAX cases, and weakened the _PC_MAX_CANON, _PC_MAX_INPUT and _PC_VDISABLE cases.
31683	12-Dec-1997	peter	Fix(?) some style consistancy breakage and do some other nit-picking on the SUIDDIR changes.
31561	05-Dec-1997	bde	Don't include <sys/lock.h> in headers when only `struct simplelock' is required. Fixed everything that depended on the pollution.
31557	05-Dec-1997	jkh	Needs to include <sys/lock.h> if we're using struct lock.
31493	02-Dec-1997	phk	In all such uses of struct buf: 's/b_un.b_addr/b_data/g'
31486	02-Dec-1997	bde	`nextgennumber' can go away now that is no longer (ab)used by foreign fs's.
31485	02-Dec-1997	bde	Use the same algorithm as ffs for generation numbers.
31484	02-Dec-1997	bde	Fix a small style bug in the generation number change (rev.1.33) before copying the change to other fs's.
31394	24-Nov-1997	bde	Fixed overflow in ufs_getblns(). For ufs on systems with 32-bit ints, triple indirect blocks only worked for block sizes of 4K, since MNINDIR(ump)3 overflows for larger block sizes (e.g., (8192/4)3 = 2**33 > INT_MAX). This fix is not the obvious one of changing some types to 64 bits. It rearranges the code to avoid some unnecessary 64-bit calculations. Reviewed by: Kirk McKusick <mckusick@McKusick.COM>
31352	22-Nov-1997	bde	Staticized.
31351	22-Nov-1997	bde	Unremoved prtrealloc and the declaration of ffs_clusteralloc(). These are used in the `#ifdef notyet' case :-). This case is used except in the `#if !defined (not_yes)' case :-\|. This has something to do with the `#ifdef notyet_block_reallocation_enabled' case in vfs_cluster.c :-(.
31312	20-Nov-1997	bde	Fixed marking of access time for special files and fifos (don't do it if the file system is mounted noatime). Not fixed: the access time is marked at the start of a read() and not marked on successful completion. I think this should be handled at the vfs level. Print a better panic message for missing vops. Don't use printf() before panic(), since the printf()ed part isn't shown by gdb. This actually loses a little with the current gdb, since gdb just prints the fmt arg to panic, so %'s aren't expanded. gdb should fetch the full message from the message buffer if possible. Fixed default vop function for vop_getpages_desc. It needs to just return EOPNOTSUPP so that the vnode pager can get the pages in using a general method. Panicing broke exec'ing of files on ext2fs file systems. ffs works because it doesn't use the default. Fixed nearby style bugs.
31274	18-Nov-1997	bde	Removed an unused #include in the `#ifdef KERNEL' case. Fixed a comment to match the code. The code is still wrong (ffs_checkoverlap() should be staticized and called from a ddb command).
31269	18-Nov-1997	phk	unifdef -UEXT2FS
31147	13-Nov-1997	julian	oops, fix left out semicolon in code I patched by hand.
31144	13-Nov-1997	julian	Reviewed by: hackers@freebsd.org in general Obtained from: Whistle Communications tree Add an option to the way UFS works dependent on the SUID bit of directories This changes makes things a whole lot simpler on systems running as fileservers for PCs and MACS. to enable the new code you must 1/ enable option SUIDDIR on the kernel. 2/ mount the filesystem with option suiddir. hopefully this makes it difficult enough for people to do this accidentally. see the new chmod(2) man page for detailed info.
31132	12-Nov-1997	julian	Reviewed by: various. Ever since I first say the way the mount flags were used I've hated the fact that modes, and events, internal and exported, and short-term and long term flags are all thrown together. Finally it's annoyed me enough.. This patch to the entire FreeBSD tree adds a second mount flag word to the mount struct. it is not exported to userspace. I have moved some of the non exported flags over to this word. this means that we now have 8 free bits in the mount flags. There are another two that might well move over, but which I'm not sure about. The only user visible change would have been in pstat -v, except that davidg has disabled it anyhow. I'd still like to move the state flags and the 'command' flags apart from each other.. e.g. MNT_FORCE really doesn't have the same semantics as MNT_RDONLY, but that's left for another day.
31016	07-Nov-1997	phk	Remove a bunch of variables which were unused both in GENERIC and LINT. Found by: -Wunused
30994	06-Nov-1997	phk	Move the "retval" (3rd) parameter from all syscall functions and put it in struct proc instead. This fixes a boatload of compiler warning, and removes a lot of cruft from the sources. I have not removed the /ARGSUSED/, they will require some looking at. libkvm, ps and other userland struct proc frobbing programs will need recompiled.
30890	01-Nov-1997	tegge	Move declaration of M_MFSNODE from mfs_vnops.c to mfsnode.h.
30889	01-Nov-1997	tegge	Bring back mfs_reclaim(), which is used to reclaim the master vnode in MFS.
30780	27-Oct-1997	bde	Removed unused #includes. The need for most of them went away with recent changes (docluster* and vfs improvements).
30779	27-Oct-1997	bde	Forward declare precisely the structs that are actually used in this header.
30743	26-Oct-1997	phk	VFS interior redecoration. Rename vn_default_error to vop_defaultop all over the place. Move vn_bwrite from vfs_bio.c to vfs_default.c and call it vop_stdbwrite. Use vop_null instead of nullop. Move vop_nopoll from vfs_subr.c to vfs_default.c Move vop_sharedlock from vfs_subr.c to vfs_default.c Move vop_nolock from vfs_subr.c to vfs_default.c Move vop_nounlock from vfs_subr.c to vfs_default.c Move vop_noislocked from vfs_subr.c to vfs_default.c Use vop_ebadf instead of *_ebadf. Add vop_defaultop for getpages on master vnode in MFS.
30608	20-Oct-1997	phk	I belive this fixes MFS after I broke it.
30563	19-Oct-1997	dyson	This might fix the mfs_badop problem left over with the cool VFS fixes. PHK should check this.
30513	17-Oct-1997	phk	Make a set of VOP standard lock, unlock & islocked VOP operators, which depend on the lock being located at vp->v_data. Saves 3x3 identical vop procs, more as the other filesystems becomes lock aware.
30496	16-Oct-1997	phk	VFS clean up "hekto commit" 1. Add defaults for more VOPs VOP_LOCK vop_nolock VOP_ISLOCKED vop_noislocked VOP_UNLOCK vop_nounlock and remove direct reference in filesystems. 2. Rename the nfsv2 vnop tables to improve sorting order.
30492	16-Oct-1997	phk	Another VFS cleanup "kilo commit" 1. Remove VOP_UPDATE, it is (also) an UFS/{FFS,LFS,EXT2FS,MFS} intereface function, and now lives in the ufsmount structure. 2. Remove VOP_SEEK, it was unused. 3. Add mode default vops: VOP_ADVLOCK vop_einval VOP_CLOSE vop_null VOP_FSYNC vop_null VOP_IOCTL vop_enotty VOP_MMAP vop_einval VOP_OPEN vop_null VOP_PATHCONF vop_einval VOP_READLINK vop_einval VOP_REALLOCBLKS vop_eopnotsupp And remove identical functionality from filesystems 4. Add vop_stdpathconf, which returns the canonical stuff. Use it in the filesystems. (XXX: It's probably wrong that specfs and fifofs sets this vop, shouldn't it come from the "host" filesystem, for instance ufs or cd9660 ?) 5. Try to make system wide VOP functions have vop_* names. 6. Initialize the um_* vectors in LFS. (Recompile your LKMS!!!)
30476	16-Oct-1997	phk	Staticize the ufs vnops member functions.
30475	16-Oct-1997	phk	Remove an overlapping variable I created in last round. Don't do pointer subtraction on void * Use VOP_STRATEGY instead of homegrown stuff. Add an XXX warning for LFS freaks to ponder.
30474	16-Oct-1997	phk	VFS mega cleanup commit (x/N) 1. Add new file "sys/kern/vfs_default.c" where default actions for VOPs go. Implement proper defaults for ABORTOP, BWRITE, LEASE, POLL, REVOKE and STRATEGY. Various stuff spread over the entire tree belongs here. 2. Change VOP_BLKATOFF to a normal function in cd9660. 3. Kill VOP_BLKATOFF, VOP_TRUNCATE, VOP_VFREE, VOP_VALLOC. These are private interface functions between UFS and the underlying storage manager layer (FFS/LFS/MFS/EXT2FS). The functions now live in struct ufsmount instead. 4. Remove a kludge of VOP_ functions in all filesystems, that did nothing but obscure the simplicity and break the expandability. If a filesystem doesn't implement VOP_FOO, it shouldn't have an entry for it in its vnops table. The system will try to DTRT if it is not implemented. There are still some cruft left, but the bulk of it is done. 5. Fix another VCALL in vfs_cache.c (thanks Bruce!)
30469	16-Oct-1997	julian	Two more places where root filesystems were mounted, put them at the head of the mount list in case there is already DEVFS present.
30439	15-Oct-1997	phk	vnops megacommit 1. Use the default function to access all the specfs operations. 2. Use the default function to access all the fifofs operations. 3. Use the default function to access all the ufs operations. 4. Fix VCALL usage in vfs_cache.c 5. Use VOCALL to access specfs functions in devfs_vnops.c 6. Staticize most of the spec and fifofs vnops functions. 7. Make UFS panic if it lacks bits of the underlying storage handling.
30434	15-Oct-1997	phk	Hmm, realign the vnops into two columns.
30431	15-Oct-1997	phk	Stylistic overhaul of vnops tables. 1. Remove comment stating the blatantly obvious. 2. Align in two columns. 3. Sort all but the default element alphabetically. 4. Remove XXX comments pointing out entries not needed.
30428	15-Oct-1997	bde	IN_HASHED goes in the in-core flags ip->i_flag, not in the on-disk flags ip->i_flags. Rev.1.18 completely broke ufs. My root directory went away about 10 seconds after booting. I think file system damage was null, since IN_HASHED = 0x80 is not used in the disk flags (it would probably be UF_SOMETHING if it were used).
30419	14-Oct-1997	phk	Reset the flag right away, could catch a bogon someday.
30418	14-Oct-1997	phk	I think my previous change may have opened a race conditio. This patch does the same thing, with no change in semantics.
30402	14-Oct-1997	phk	ufs_ihashrem() should not be called from the UFS layer, but from the lower layer (LFS/FFS/?) like the rest of the ihash functions. Otherwise it is impossible to make a lower layer that doesn't use the ihash facility.
30354	12-Oct-1997	phk	Last major round (Unless Bruce thinks of somthing :-) of malloc changes. Distribute all but the most fundamental malloc types. This time I also remembered the trick to making things static: Put "static" in front of them. A couple of finer points by: bde
30309	11-Oct-1997	phk	Distribute and statizice a lot of the malloc M_* types. Substantial input from: bde
30285	10-Oct-1997	phk	Make ufs_reclaim free the underlying inode.
30284	10-Oct-1997	phk	Use generic ufs_reclaim().
30283	10-Oct-1997	phk	Add type arg to ffs_mountfs and avoid examining v_tag to find out if MFS is getting a free ride. Use generic ufs_reclaim().
29888	27-Sep-1997	kato	Clustered read and write are switched at mount-option level. 1. Clustered I/O is switched by the MNT_NOCLUSTERR and MNT_NOCLUSTERW bits of the mnt_flag. The sysctl variables, vfs.foo.doclusterread and vfs.foo.doclusterwrite are deleted. Only mount option can control clustered I/O from userland. 2. When foofs_mount mounts block device, foofs_mount checks D_CLUSTERR and D_CLUSTERW bits of the d_flags member in the block device switch table. If D_NOCLUSTERR / D_NOCLUSTERW are set, MNT_NOCLUSTERR / MNT_NOCLUSTERW bits will be set. In this case, MNT_NOCLUSTERR and MNT_NOCLUSTERW cannot be cleared from userland. 3. Vnode driver disables both clustered read and write. 4. Union filesystem disables clutered write. Reviewed by: bde
29725	22-Sep-1997	joerg	Make MFS a supported option, finally.
29685	21-Sep-1997	gibbs	Convert tqdisksort to bufqdisksort. Honor the B_ORDERED buffer flag so that meta-data writes go out to the device in the right order.
29684	21-Sep-1997	gibbs	Update for new buffer queue data structure.
29653	21-Sep-1997	dyson	Change the M_NAMEI allocations to use the zone allocator. This change plus the previous changes to use the zone allocator decrease the useage of malloc by half. The Zone allocator will be upgradeable to be able to use per CPU-pools, and has more intelligent usage of SPLs. Additionally, it has reasonable stats gathering capabilities, while making most calls inline.
29609	19-Sep-1997	phk	[Regarding the previous patch] This is completely wrong. 1. ffs_alloc() actually allowed writing one block less one frag (normally 7 frags or 7/8 blocks) beyond the limit. 2. freebufspace() gives the free space in frags, but `size' is in bytes, so the change results in approximately `size' fragments too many being reserved. 3. ffs_realloccg() has the same bug but wasn't changed. PR: 3398 Submitted by: bde Eyeballed by: phk
29581	18-Sep-1997	phk	Ffs_alloc allow users to write one block beyond the limit. PR: 3398 Reviewed by: phk Submitted by: Wolfram Schneider <wosch@apfel.de>
29362	14-Sep-1997	peter	Convert select -> poll. Delete 'always succeed' select/poll handlers, replaced with generic call. Flag missing vnode op table entries.
29287	10-Sep-1997	phk	Update the comment and remove checks now done centrally.
29208	07-Sep-1997	bde	Removed yet more vestiges of config-time swap configuration and/or cleaned up nearby cruft.
29041	02-Sep-1997	bde	Removed unused #includes.
28954	31-Aug-1997	phk	Change the 0xdeadb hack to a flag called VDOOMED. Introduce VFREE which indicates that vnode is on freelist. Rename vholdrele() to vdrop(). Create vfree() and vbusy() to add/delete vnode from freelist. Add vfree()/vbusy() to keep (v_holdcnt != 0 \|\| v_usecount != 0) vnodes off the freelist. Generalize vhold()/v_holdcnt to mean "do not recycle". Fix reassignbuf()s lack of use of vhold(). Use vhold() instead of checking v_cache_src list. Remove vtouch(), the vnodes are always vget'ed soon enough after for it to have any measuable effect. Add sysctl debug.freevnodes to keep track of things. Move cache_purge() up in getnewvnodes to avoid race. Decrement v_usecount after VOP_INACTIVE(), put a vhold() on it during VOP_INACTIVE() Unmacroize vhold()/vdrop() Print out VDOOMED and VFREE flags (XXX: should use %b) Reviewed by: dyson
28787	26-Aug-1997	phk	Uncut&paste cache_lookup(). This unifies several times in theory indentical 50 lines of code. The filesystems have a new method: vop_cachedlookup, which is the meat of the lookup, and use vfs_cache_lookup() for their vop_lookup method. vfs_cache_lookup() will check the namecache and pass on to the vop_cachedlookup method in case of a miss. It's still the task of the individual filesystems to populate the namecache with cache_enter(). Filesystems that do not use the namecache will just provide the vop_lookup method as usual.
28774	26-Aug-1997	dyson	Back out some incorrect changes that was worse than the original bug.
28701	25-Aug-1997	kato	Renamed doclusterread/write to unique names (ffs_doclusterread/write), and staticize them. Move the #include of <sys/sysctl.h> to the top of the file. Pointed out by: Bruce Evans <bde@zeta.org.au>
28598	22-Aug-1997	dyson	Fix the "remove optimization" by removing it. Sorry for the trouble.
28558	22-Aug-1997	dyson	This is a trial improvement for the vnode reference count while on the vnode free list problem. Also, the vnode age flag is no longer used by the vnode pager. (It is actually incorrect to use then.) Constructive feedback welcome -- just be kind.
28466	21-Aug-1997	dyson	Performance improvment to minimize delayed write output of files that have been deleted. Submitted by: Peter M. Chen <pmchen@eecs.umich.edu>
28270	16-Aug-1997	wollman	Fix all areas of the system (or at least all those in LINT) to avoid storing socket addresses in mbufs. (Socket buffers are the one exception.) A number of kernel APIs needed to get fixed in order to make this happen. Also, fix three protocol families which kept PCBs in mbufs to not malloc them instead. Delete some old compatibility cruft while we're at it, and add some new routines in the in_cksum family.
27890	04-Aug-1997	phk	We got a couple of "map mismatch" panics from the following code. According to the crash dump, bpref is set to 445 and cgp->cg_nclusterblks is 444. Hence in the for loop, the test fails immediately but the following failure check (got == cgp->cg_nclusterblks) doesn't trigger because got > cgp->cg_nclusterblks. This wreaks havoc in the code after that. Fix: Move one source bit to the left :-) Noticed by: Mike Hibler <mike@fast.cs.utah.edu> Submitted by: Kirk McKusick <mckusick@McKusick.COM>
27845	02-Aug-1997	bde	Removed unused #includes.
27378	13-Jul-1997	bde	Always mark st_ctime for update upon successful completion of chown(). Previously, it wasn't marked for null chown()'s. We permit null chown()s as a special case of "appropriate privilege" - everyone has enough priviilege to not change ids (this is a better argument than the one I gave for rev.1.13, that null changes aren't really changes). However, POSIX.1 requires the update independently of whether anything has changed. Clear both the setuid and the setgid bits upon successful completion of non-null chown()s by non-root. Previously, the setuid bit was only changed for non-null changes of the uid, etc. POSIX.1 requires clearing both unless the call was made by a process with "appropriate privilege", in which case altering the bits is implementation-defined. We define appropriate privilege as `process is root, or the change is null', and the implementation-defined behaviour as not altering the bits. There is no interpretation that permits clearing only one of the bits. Reviewed by: jdp
27377	13-Jul-1997	bde	Use the correct size for a sector in the search for a label in readdisklabel(). Sectors may be larger than DEV_BSIZE.
27376	13-Jul-1997	bde	Removed semicolon from the end of a #define.
27375	13-Jul-1997	bde	Fixed comment about i_spare.
26664	15-Jun-1997	dyson	Fix a problem with the VN device. Specifically, the VN device can cause a problem of spiraling death due to buffer resource limitations. The vfs_bio code in general had little ability to handle buffer resource management, and now it does. Also, there are a lot more knobs for tuning the vfs_bio code now. The knobs came free because of the need that there always be some immediately available buffers (non-delayed or locked) for use. Note that the buffer cache code is much less likely to get bogged down with lots of delayed writes, even more so than before.
26360	02-Jun-1997	julian	Submitted by: Whistle Communications (archie Cobbs) These changes add the ability to specify that a UFS file/directory cannot be unlinked. This is basically a scaled back version of the IMMUTABLE flag. The reason is to allow an administrator to create a directory hierarchy that a group of users can arbitrarily add/delete files from, but that the hierarchy itself is safe from removal by them. If the NOUNLINK definition is set to 0 then this results in no change to what happens normally. (and results in identical binary (in the kernel)). It can be proven that if this bit is never set by the admin, no new behaviour is introduced.. Several "good idea" comments from reviewers plus one grumble about creeping featurism. This code is in production in 2.2 based systems
26112	25-May-1997	peter	Fix warnings (from LINT). Missing static prototype, missing vm includes for vnode_pager_setsize().
26001	22-May-1997	phk	Shrink struct inode by 20 bytes, so that malloc wastes less space. Pointed out by: bde
25877	17-May-1997	phk	Remove redundant check for vp == dvp (done in VFS before calling).
25244	28-Apr-1997	jkh	Mount MFS read/write as in days of yore.
24775	10-Apr-1997	bde	Use smalllblktosize() instead of multiplying small block numbers by fs->fs_bsize. The macro is usually faster and makes it clearer that the multiplication can't overflow.
24477	01-Apr-1997	bde	Removed nested include of <ufs/ufs/dir.h>. Use the pre-Lite2 hack of defining doff_t both here and in <ufs/ufs/dir.h> so that this file is independent of <ufs/ufs/dir.h>. It still has old prerequisites <sys/param.h> and <ufs/ufs/quota.h>, and a new Lite2 prerequisite of <sys/lock.h>, sigh. This might fix lsof, which was broken by namespace pollution giving conflicting definitions of DIRBLKSIZ.
24438	31-Mar-1997	peter	Treat symlinks as first class citizens with their own uid/gid rather than as shadows of their containing directory. This should solve the problem of users not being able to delete their symlinks from /tmp once and for all. Symlinks do not have modes though, they are accessable to everything that can read the directory (as before). They are made to show this fact at lstat time (they appear as mode 0777 always, since that's how the the lookup routines in the kernel treat them). More commits will follow, eg: add a real lchown() syscall and man pages.
24203	24-Mar-1997	bde	Don't include <sys/ioctl.h> in the kernel. Stage 1: don't include it when it is not used. In most cases, the reasons for including it went away when the special ioctl headers became self-sufficient.
24171	24-Mar-1997	bde	Fixed corrupted newline and corrupted tab in previous commit.
24149	23-Mar-1997	guido	Add generation number randomization. Newly created filesystems wil now automatically have random generation numbers. The kenel way of handling those also changed. Further it is advised to run fsirand on all your nfs exported filesystems. the code is mostly copied from OpenBSD, with the randomization chanegd to use /dev/urandom Reviewed by: Garrett Obtained from: OpenBSD
24131	23-Mar-1997	bde	Don't #include <sys/fcntl.h> in <sys/file.h> if KERNEL is defined. Fixed everything that depended on getting fcntl.h stuff from the wrong place. Most things don't depend on file.h stuff at all.
24128	23-Mar-1997	bde	Merged the rest of lfs from Lite2. It compiles (uncleanly) but is as unlikely to work as before.
24103	22-Mar-1997	bde	Merged enough of lfs from Lite2 for mkdep of LINT to work again.
24102	22-Mar-1997	bde	Removed `volatile' from declaration of `time', and removed the resulting null casts. `time' is nonvolatile for accesses within a region locked by splclock()/splx(). Accesses outside such a region are invalid, and splx() must have the side effect of potentially changing all global variables (since there are hundreds of sort of volatile variables like `time'), so declaring `time' as volatile didn't have any real benefits.
24101	22-Mar-1997	bde	Fixed some invalid (non-atomic) accesses to `time', mostly ones of the form `tv = time'. Use a new function gettime(). The current version just forces atomicicity without fixing precision or efficiency bugs. Simplified some related valid accesses by using the central function.
24098	22-Mar-1997	bde	Backed out rev.1.27, which broke unmounting of mfs and caused panics on shutdown. Should not have been in 2.2 (the buggy last minute change, that is).
23998	18-Mar-1997	peter	MAXDIRSIZE is (or would be) used in fsck. It's a sanity check.
23997	18-Mar-1997	peter	Restore the lost MNT_LOCAL flag twiddle. Lite2 has a different mechanism of setting it (compiled into vfs_conf.c), but we have a dynamic system in place. This could probably be better done via a runtime configure flag in the VFS_SET() VFS declaration, perhaps VFCF_LOCAL, and have the VFS code propagate this down into MNT_LOCAL at mount time. The other FS's would need to be updated, havinf UFS and MSDOSFS filesystems without MNT_LOCAL breaks a few things.. the man page rebuild scans for local filesystems and currently fails, I suspect that other tools like find and tar with their "local filesystem only" modes might be affected.
23908	15-Mar-1997	sos	Fix support for != 512 byte sector devices. Restores the use of SBLOCK instead of the BSOFF/sectorsize calculation. Using SBLOCK is bogus however in that it uses DEV_BSIZE instead of the actual sector size, but that is taken care of in other places. Changing the SBLOCK would be better, but it affects the system in other places, and doing it this way makes it possible to use filesystems that was made before the lite2 merge.
23562	09-Mar-1997	mpp	Update a number of routines to reflect the actual name of the routine that caused the panic.
23560	09-Mar-1997	mpp	Update a number of panic messages to reflect the actual name of the routine that caused the panic.
23388	05-Mar-1997	msmith	Supply the mount point given to mfs_mount when getting a vnode for the mount. This may have been a contributor to the 'null v_mount in fsync()' problem This is another, perhaps slightly less urgent, 2.2 last-minute candidate. Reviewed by: sef
23383	04-Mar-1997	bde	Fixed connection of vfs.ffs node to the sysctl tree.
23347	03-Mar-1997	bde	Removed unused flag IN_RECURSE and unused struct member i_lockcount.
23346	03-Mar-1997	bde	Removed useless setting of IN_RECURSE. The (anti) locking for this needs to be done in a different way, if at all.
22975	22-Feb-1997	peter	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
22881	18-Feb-1997	bde	This now uses queue macros. Include <sys/queue.h> if !KERNEL to preserve the documented interface.
22619	13-Feb-1997	bde	Removed FIFO ifdef again (see rev.1.5).
22579	12-Feb-1997	mpp	Add function prototypes for most of the new Lite2 functions. Also made a few of the miscfs routines static to be consistent. Some modules simply required some additional #includes to remove -Wall warnings.
22544	10-Feb-1997	mpp	Correct the new Lite2 #ifdef DIAGNOSTIC ffs_checkblk routine to not return without setting a return value when it can't read a block error or detects a bad cylinder group, since the caller is expecting a return value. It will now panic at this point, since the thing to do in this case would be to return a "bad block" status to the caller, and the caller will panic anyways when that happens. Also updated to panic strings in this routine to read "ffs_checkblk: ..." instead of "checkblk: ...".
22540	10-Feb-1997	mpp	Make this compile after the Lite2 merge. A non-existent variable was being used.
22539	10-Feb-1997	mpp	Make ffs_subr.c compile when DIAGNOSTIC is defined. It looks like this was broken before the Lite2 merge :-(. VOP_BMAP was being called with the wrong number of arguments.
22521	10-Feb-1997	dyson	This is the kernel Lite/2 commit. There are some requisite userland changes, so don't expect to be able to run the kernel as-is (very well) without the appropriate Lite/2 userland changes. The system boots and can mount UFS filesystems. Untested: ext2fs, msdosfs, NFS Known problems: Incorrect Berkeley ID strings in some files. Mount_std mounts will not work until the getfsent library routine is changed. Reviewed by: various people Submitted by: Jeffery Hsu <hsu@freebsd.org>
21673	14-Jan-1997	jkh	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
21002	29-Dec-1996	dyson	This commit is the embodiment of some VFS read clustering improvements. Firstly, now our read-ahead clustering is on a file descriptor basis and not on a per-vnode basis. This will allow multiple processes reading the same file to take advantage of read-ahead clustering. Secondly, there previously was a problem with large reads still using the ramp-up algorithm. Of course, that was bogus, and now we read the entire "chunk" off of the disk in one operation. The read-ahead clustering algorithm should use less CPU than the previous also (I hope :-)). NOTE: THAT LKMS MUST BE REBUILT!!!
20311	11-Dec-1996	dyson	Significant performance improvement for mmap'ed files. This commit makes MADV_SEQUENTIAL much more effective. I suggest that we start using MADV_SEQUENTIAL on system utilities that mmap their input files, and the I/O is predominantely sequential. Below is a test with 'cmp' on two relatively large binary files, where the files are so large that the caching is ineffective: + ls -l t1.xxx t2.xxx -rw-r--r-- 1 root wheel 65598384 Dec 10 12:13 t1.xxx -rw-r--r-- 1 root wheel 65598384 Dec 10 12:14 t2.xxx + time cmp t1.xxx t2.xxx 3.78user 0.70system 1:33.43elapsed 4%CPU + time cmpmadv t1.xxx t2.xxx 4.21user 1.05system 0:30.93elapsed 17%CPU This change is as a result of an observation made by BDE.
20070	01-Dec-1996	bde	Removed all references to b_cylinder (aka b_cylin). It was evil and hasn't been used for a year or two since disksort() started sorting on b_pblkno.
20061	01-Dec-1996	sos	This update adds the support for != 512 byte sector SCSI devices to the sd & od drivers. There is also slight changes to fdisk & newfs in order to comply with different sectorsizes. Currently sectors of size 512, 1024 & 2048 are supported, the only restriction beeing in fdisk, which hunts for the sectorsize of the device. This is based on patches to od.c and the other system files by John Gumb & Barry Scott, minor changes and the sd.c patches by me. There also exist some patches for the msdos filesys code, but I havn't been able to test those (yet). John Gumb (john@talisker.demon.co.uk) Barry Scott (barry@scottb.demon.co.uk)
19700	13-Nov-1996	julian	Submitted by: Archie and me. We encountered an interesting situation where the superblock for a file system got written to disk with the "fs_fmod" flag set to one. It appears that this flag is normally supposed to be cleared during ffs_sync(), but we experienced a crash, or some other weird occurrence that left it on the disk set to 1. Later this partition was mounted read-only... and the fs_fmod field was never cleared, causing ffs_sync() to panic "rofs mod" when trying to unmount that filesystem (ffs_vfsops.c: line 790). fix: set this bit to 0 when you load the superblock from disk. (see more complete mail on this to hackers)
19424	05-Nov-1996	dg	Eliminate an unnecessary synchronous write (and an 8K bcopy+bzero) when truncating/deleting large files. Reviewed by: mckusick, dyson Submitted by: Kirk McKusick <mckusick@mckusick.com>, modified for FreeBSD by me.
19403	04-Nov-1996	hsu	struct mfsnode bloated in size by 12 bytes, so reduce spare padding by 3 longs. We now only have 4 spare bytes before hitting the dreaded 32 byte threshold.
19388	04-Nov-1996	bde	Fixed some races and misleading comments in ufs_rename(). 1. When a directory is renamed to an existing (empty) directory, it is possible for the target vnode to become the source vnode underneath you (because another process may complete the same rename). It was assumed that this can't happen, and the bogus errno EINVAL was returned. This was fairly harmless. Fix: return ENOENT instead, as if the source directory was renamed a little earlier. 2. The same metamorphosis is possible for non-directories. It was assumed that this can't happen, and the code for handling "just removing a link name" happened to be used. This would have worked except for fatal bugs in the link name removal - the link name was assumed to still be there, and a null pointer was followed. Fix: check the result of relookup(). This fixes PR 1930. Notes: (a) POSIX seems to say that removing link names shall have no effect. BSD (4.4Lite2 at least) does something reasonable instead. (b) The relookup() may find a file unrelated to the original. Removing this isn't correct. Consider 3 existing files A, B and C, and concurrent renames: AB = rename(A, B), another AB, and CA = rename("c", "a"). If rename() is atomic, then only the following results are possible: AB, AB (fails), CA: A = original C, B = original A, C = gone AB, CA, AB: A = gone, B = original C, C = gone CA, AB, AB (fails): A = gone, B = original C, C = gone but ufs_rename() can give: A,AB,CA,B (sorta): A = gone, B = original A, C = gone This usually doesn't matter, since getting into a race is usually an error. --- These fixes should be in 2.1.6 and 2.2.
18899	12-Oct-1996	bde	Fixed lblktosize(). It overflowed at 2G. This bug only affected ufs_read() and ufs_write(). Found by: looking at warnings for comparing the result of lblktosize() (which is usually daddr_t = long) with file sizes (which are u_quad_t for ufs). File sizes should probably be off_t's to avoid warnings when the are compared with file offsets, so the fixed lblktosize() casts to off_t instead of u_quad_t. Added definition of smalllblksize(). It is the same as the old lblksize() and is more efficient for small block numbers on 32-bit machines. Use smalllblktosize() instead of its expansion in blksize() and dblksize(). This keeps the line length short and makes it more obvious that the shift can't overflow.
18429	20-Sep-1996	bde	Don't include <sys/conf.h> for the kernel in disk-related headers. It is needed for implementation details but very little of it is needed for the interface. Include it in the few places that didn't already include it. Include <sys/ioccom.h> in <sys/disklabel.h> (as already in <sys/diskslice.h>) so that all the disk-related headers are almost self-sufficient.
18413	20-Sep-1996	nate	Whoops, I should've used the LINT config file. More ts -> tv changes for timespec structure.
18397	19-Sep-1996	nate	In sys/time.h, struct timespec is defined as: /* * Structure defined by POSIX.4 to be like a timeval. / struct timespec { time_t ts_sec; / seconds / long ts_nsec; / and nanoseconds */ }; The correct names of the fields are tv_sec and tv_nsec. Reminded by: James Drobina <jdrobina@infinet.com>
18330	17-Sep-1996	peter	Argh, I have had one "uid 0 on /: file system full" too many. The problem is that it doesn't say _what_ did it! (the core dumped console message is very useful for listing the process name and pid). This adds similar information.
18104	07-Sep-1996	dyson	Fix a VOP_UNLOCK panic when using options DIAGNOSTIC during dismount.
18069	06-Sep-1996	gibbs	Use bowrite instead of VOP_BWRITE in a few cases. This can probably be taken further.
18020	03-Sep-1996	bde	Eliminated nested include of <sys/unistd.h> in <sys/file.h> in the kernel. Include it directly in the few places where it is used. Reduced some #includes of <sys/file.h> to #includes of <sys/fcntl.h> or nothing.
18006	03-Sep-1996	dg	Implemented kernel side of MNT_NOATIME mount option. This option disables the file access time update on reads and can be useful in reducing filesystem overhead in cases where the access time is not important (like Usenet news spools).
17971	31-Aug-1996	bde	Don't depend in the kernel on the gcc feature of doing arithmetic on pointers of type `void *'. Warn about this in future.
17761	21-Aug-1996	dyson	Even though this looks like it, this is not a complex code change. The interface into the "VMIO" system has changed to be more consistant and robust. Essentially, it is now no longer necessary to call vn_open to get merged VM/Buffer cache operation, and exceptional conditions such as merged operation of VBLK devices is simpler and more correct. This code corrects a potentially large set of problems including the problems with ktrace output and loaded systems, file create/deletes, etc. Most of the changes to NFS are cosmetic and name changes, eliminating a layer of subroutine calls. The direct calls to vput/vrele have been re-instituted for better cross platform compatibility. Reviewed by: davidg
17108	12-Jul-1996	bde	Don't use NULL in non-pointer contexts.
17040	09-Jul-1996	wollman	Quiet a couple of -Wunused warnings.
16681	25-Jun-1996	dg	Fixed end condition for clustered reads. Submitted by: Kirk McKusick via Lite-2 and email
16322	12-Jun-1996	gpalmer	Clean up -Wunused warnings. Reviewed by: bde
16312	12-Jun-1996	dg	Moved the fsnode MALLOC to before the call to getnewvnode() so that the process won't possibly block before filling in the fsnode pointer (v_data) which might be dereferenced during a sync since the vnode is put on the mnt_vnodelist by getnewvnode. Pointed out by Matt Day <mday@artisoft.com>
15680	08-May-1996	gpalmer	Clean up various compiler warnings. Most (if not all) were benign Reviewed by: bde
15576	03-May-1996	phk	disksort() is gone, all drivers now use tqdisksort().
15543	02-May-1996	phk	removed: CLBYTES PD_SHIFT PGSHIFT NBPG PGOFSET CLSIZELOG2 CLSIZE pdei() ptei() kvtopte() ptetov() ispt() ptetoav() &c &c new: NPDEPG Major macro cleanup.
15493	01-May-1996	bde	Removed bogus _BEGIN_DECLS/_END_DECLS. Removed unused struct tag declarations in cloned code. Added or cleaned up idempotency ifdefs.
15315	19-Apr-1996	bde	Yet more b_flags fixes. The previous ones broke the clearing of B_DONE and B_READ before writing. This was was fatal. They also broke the clearing of B_INVAL before doing i/o. This didn't actually matter. Submitted by: mostly by joerg
15140	08-Apr-1996	phk	Replace usage of buf->b_actf by queue.3 and buf->b_act
14909	29-Mar-1996	bde	Fixed reference counting related to relookup(). relookup() must be called with the directory referenced, and this reference will be dropped iff relookup() fails, so the value returned must not be ignored. Reviewed by: davidg
14831	27-Mar-1996	hsu	Make type compatible with Lite2. Submitted by: bde
14345	02-Mar-1996	dyson	Handle the bogus device that MFS uses as its VBLK device. We now don't try to VMIO open it on MFS mounts. This will fix the mfs_badops panic.
14317	02-Mar-1996	dyson	Enable VMIO for non-VDIR metadata and block device.
14315	02-Mar-1996	dyson	More b_flags fixes.
14312	01-Mar-1996	dyson	Fix a bug that b_flags was getting unnecessarily modified by the slice code. The effect up to now has been insignficant, but improved buffer allocation code will break with this problem.
14279	27-Feb-1996	mpp	Add a prototype for the quotactl system call.
14249	25-Feb-1996	bde	Removed vestigial support for the obsolete FIFO option. In ext2fs it caused null pointer panics for all fifo operations unless FIFO was defined.
13765	30-Jan-1996	mpp	Fix a bunch of spelling errors in the comment fields of a bunch of system include files.
13490	19-Jan-1996	dyson	Eliminated many redundant vm_map_lookup operations for vm_mmap. Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish overhead for merged cache. Efficiency improvement for vfs_cluster. It used to do alot of redundant calls to cluster_rbuild. Correct the ordering for vrele of .text and release of credentials. Use the selective tlb update for 486/586/P6. Numerous fixes to the size of objects allocated for files. Additionally, fixes in the various pagers. Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs. Fixes in the swap pager for exhausted resources. The pageout code will not as readily thrash. Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE), thereby improving efficiency of several routines. Eliminate even more unnecessary vm_page_protect operations. Significantly speed up process forks. Make vm_object_page_clean more efficient, thereby eliminating the pause that happens every 30seconds. Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the case of filesystems mounted async. Fix a panic with busy pages when write clustering is done for non-VMIO buffers.
13424	14-Jan-1996	bde	Partially fixed negative and truncated "Avail" counts in df output. This fixes PR943. ffs/ffs_vfsops.c: ffs_statfs() multiplied by (100 - minfree) as part of calculating the minfree percentage (complemented in 100%), so with the standard minfree of 8, it was broken for file systems of size >= 1TB/92 = 11GB. Use the standard freespace() macro instead. This also fixes a rounding bug (the "Avail" count was sometimes 1 too small). ffs/* (not fixed): The freespace() macro multiplies by minfree, so with the standard minfree of 8, it is broken for file systems of size >= 1TB/8 = 128GB. This bug is more serious since it affects block allocation. ffs/ffs_alloc.c (not fixed): Ordinary users are sometimes allowed to allocate 1 (partial) block too many so that the "Avail" count goes negative. E.g., if there is 1 fragment available and the file is fairly large, one more full block is allocated. df/df.c: ufs_df() used/uses essentially the same code as ffs_statfs(), so it had/has the same bugs. ufs_df() gratuitously replaced "Avail" counts of < 0 by 0, so it gave different results for non-mounted file systems in this case.
13309	07-Jan-1996	phk	The second cast wasn't needed. Submitted by: bde
13273	06-Jan-1996	phk	Fix the asami&phk bug. This was a sign-extension bug, where a long got multiplied by a constant before being upgraded to long long. This should fix kern/104 and possibly kern/105. Thanks to: dyson & asami.
13260	05-Jan-1996	wollman	Convert QUOTA to new-style option.
13228	04-Jan-1996	wollman	Convert DDB to new-style option.
13122	30-Dec-1995	peter	recording cvs-1.6 file death
12976	22-Dec-1995	bde	Fixed prototyping and staticizing for -DDEBUG case.
12971	22-Dec-1995	phk	Staticize.
12911	17-Dec-1995	phk	Staticize.
12861	15-Dec-1995	peter	Silence a harmless warning...
12838	14-Dec-1995	bde	Included <sys/conf.h> and updated to indirect devswitches so that this compiles again, and added a prototype.
12825	14-Dec-1995	peter	hack alert! :-) This adds an option to the MFS_ROOT code so that it is possible to boot a kernel with an empty in-core MFS image, and have it load the image from floppy directly. This is admittedly a hack and would be better replaced by a self-loading ram-disk.
12767	11-Dec-1995	dyson	Changes to support 1Tb filesizes. Pages are now named by an (object,index) pair instead of (object,offset) pair.
12662	07-Dec-1995	dg	Untangled the vm.h include file spaghetti.
12653	06-Dec-1995	bde	Fixed compilation of lfs utilities which I broke the other day by #including lfs_extern.h and goop to support it in lfs_conv.c.
12590	03-Dec-1995	bde	Completed function declarations and/or added prototypes and/or #includes to get the prototypes.
12500	28-Nov-1995	bde	Removed bogus __BEGIN_DECS/__END_DECLS.
12499	28-Nov-1995	peter	After having put on my Asbestos suit, complete the MFS_ROOT part of Terry's mountroot changes. This means that the mfs_initminiroot functionality into the root mfs_mount....
12497	28-Nov-1995	peter	Attempt to solve the busy-buffers-on-shutdown caused by MFS once and for all. What was happening, was that the main mfs loop was sleeping, and when it was being awoken by a wakeup when it was supposed to process some IO requests. The problem was that if it was being woken out of the tsleep() by a signal at shutdown, it was going straight into dounmount() without servicing any pending IO requests, causing dounmount() to fail because there were busy buffers (and they could not be "processed" because the processing loop was trying to unmount rather than dispatching into mfs_doio()). This (dare I say it :-) appears to be a layering problem....
12460	23-Nov-1995	dyson	Update the wd.c driver to use the new TAILQ scheme for device buffer queue. Also, create a new subroutine 'tqdisksort' that is an improved version of the original disksort that also uses TAILQs.
12453	21-Nov-1995	bde	Completed function declarations and/or added prototypes.
12424	20-Nov-1995	phk	Fix compiler warnings.
12405	19-Nov-1995	dyson	General fixes to the vfs clustring code: 1) Make cluster buffer list be a non-malloced chain. This eliminates yet another 'evil' M_WAITOK and generally cleans up the code. 2) Fix write clustering for ext2fs. It was just broken. Also, ffs clustering had an efficiency problem that more bawrites were happening than should have been. 3) Make changes to buf.h to support the above, plus remove b_pfcent at the request of David Greenman. Note that the reallocblocks code is disabled pending rewrite for the cluster buffer list changes.
12399	19-Nov-1995	dyson	Change incorrect '#if EXT2FS' to '#ifdef EXT2FS'
12288	14-Nov-1995	phk	Get rid of the last debug sysctl variables of the old style.
12221	12-Nov-1995	bde	Included <sys/sysproto.h> to get central declarations for syscall args structs and prototypes for syscalls. Ifdefed duplicated decentralized declarations of args structs. It's convenient to have this visible but they are hard to maintain. Some are already different from the central declarations. 4.4lite2 puts them in comments in the function headers but I wanted to avoid the large changes for that.
12158	09-Nov-1995	bde	Introduced a type `vop_t' for vnode operation functions and used it 1138 times (:-() in casts and a few more times in declarations. This change is null for the i386. The type has to be `typedef int vop_t(void *)' and not `typedef int vop_t()' because `gcc -Wstrict-prototypes' warns about the latter. Since vnode op functions are called with args of different (struct pointer) types, neither of these function types is any use for type checking of the arg, so it would be preferable not to use the complete function type, especially since using the complete type requires adding 1138 casts to avoid compiler warnings and another 40+ casts to reverse the function pointer conversions before calling the functions.
12120	06-Nov-1995	dyson	This commit causes UFS to perform at Linux EXT2FS metadata rates. After earlier discussions with DG, and a recent email exchange with SEF, I decided to allow UFS to run wide-open on an experimental basis. We will probably support eventually multiple async modes, and this is the fastest the we can expect. Just use the -o async flag on the UFS mount. Good luck...
12117	05-Nov-1995	dyson	Changes to existing files for ext2fs support. The UFS mods need rework in the future as they are a bit crufty -- but at least the stuff is in the tree now.
12114	05-Nov-1995	dyson	Fix ufs_bmap so that triple indirect blocks might work. Submitted by: Godmar Back <gback@facility.cs.utah.edu>
12111	05-Nov-1995	dyson	Make MNT_ASYNC more effective for UFS. It should not be too much more dangerous than the original MNT_ASYNC. There might be some minor security considerations due to data writes not being posted as promptly as before. Meta-data operations are still not quite as fast as Linux, but streaming I/O is still higher.
11953	31-Oct-1995	peter	mfs_open could panic with false identification: panic("mfs_ioctl: ....
11701	23-Oct-1995	dyson	Finalize GETPAGES layering scheme. Move the device GETPAGES interface into specfs code. No need at this point to modify the PUTPAGES stuff except in the layered-type (NULL/UNION) filesystems.
11644	22-Oct-1995	dg	Moved the filesystem read-only check out of the syscalls and into the filesystem layer, as was done in lite-2. Merged in some other cosmetic changes while I was at it. Rewrote most of msdosfs_access() to be more like ufs_access() and to include the FS read-only check. Obtained from: partially from 4.4BSD-lite2
11297	07-Oct-1995	bde	Return EINVAL instead of panicing for rename("dir1", "dir2/.."). Fixes part of PR 760. This bug seems to be very old.
11264	06-Oct-1995	phk	use roundup2 to avoid a bunch of 64bit divides.
10998	25-Sep-1995	dyson	Re-enable read clustering.
10949	22-Sep-1995	dg	Shit! I changed the wrong doclusterread! ...Thanks to Steven Wallace and Poul-Henning for convincing me that I should look at my mistake! :-)
10946	22-Sep-1995	dg	Disable file read clustering until the bug(s) in vfs_cluster.c are fixed. This should temporarily fix the sig 10/11 problems that people have been having for the past 3 weeks.
10823	16-Sep-1995	bde	Remove transitory labelling code. Labels are now handled by essentially the original 4.4lite code. Machine Specific Partitions are now handled separately.
10675	11-Sep-1995	bde	Fix benign type mismatch in a call to VOP_BMAP().
10646	09-Sep-1995	julian	Obtained from:4.4lite2 fix a change where a shortcut resulted in teh wrong answer.. e.g. touch a touch b mv a b resulted in b being removed and a being moved to b in the shortcut.. touch a ln a b mv a b the wrong link was removed.. leaving a instead of b, giving a different result to when both files were separate.
10632	08-Sep-1995	dg	Slight optimization for the standard case of rotdelay=0.
10597	07-Sep-1995	dyson	Correct a case in the ffs_getpages where a page is not found in a sparse file and the page is zeroed but not set valid, clean.
10578	06-Sep-1995	dyson	Added indirect pointer for ffs_getpages, and added external declaration.
10577	06-Sep-1995	dyson	Added new ffs_getpages routine. It isn't optimized yet, but FFS now does it's own getpage -- instead of using the default routine in vnode_pager.c.
10552	04-Sep-1995	dyson	Correct prototype for ufs_bmaparray()
10551	04-Sep-1995	dyson	Added VOP_GETPAGES/VOP_PUTPAGES and also the "backwards" block count for VOP_BMAP. Updated affected filesystems...
10431	30-Aug-1995	bde	Declare vfs_mountroot() in the right place.
10389	28-Aug-1995	bde	Fix correct_writedisklabel() and writedisklabel(). Their setting of bp->b_flags has been broken for many years: a) they didn't set B_BUSY for doing i/o. This has been fatal since 1995/07/25 when biodone() started checking that B_BUSY is set. b) they didn't set B_INVAL for releasing the buffer. This at best just put a useless buffer in the LRU queue for a little while. Fix a couple of spelling errors and complete a couple of function pointer declarations.
10358	28-Aug-1995	julian	Reviewed by: julian with quick glances by bruce and others Submitted by: terry (terry lambert) This is a composite of 3 patch sets submitted by terry. they are: New low-level init code that supports loadbal modules better some cleanups in the namei code to help terry in 16-bit character support some changes to the mount-root code to make it a little more modular.. NOTE: mounting root off cdrom or NFS MIGHT be broken as I haven't been able to test those cases.. certainly mounting root of disk still works just fine.. mfs should work but is untested. (tomorrows task) The low level init stuff includes a total rewrite of init_main.c to make it possible for new modules to have an init phase by simply adding an entry to a TEXT_SET (or is it DATA_SET) list. thus a new module can be added to the kernel without editing any other files other than the 'files' file.
10269	25-Aug-1995	bde	Don't call VOP_UPDATE() with volatile timestamps.
10129	20-Aug-1995	dg	Fixed mfs reboot panic by never returning failure from mfs_start(). Obtained from: 4.4BSD-Lite2
10080	16-Aug-1995	bde	Make everything except the unsupported network sources compile cleanly with -Wnested-externs.
10078	16-Aug-1995	dg	Honor -async mount option when doing the inode update. Obtained from: 4.4BSD-Lite2
10027	11-Aug-1995	dg	Converted mountlist to a CIRCLEQ. Partially obtained from: 4.4BSD-Lite2
9984	07-Aug-1995	dg	On closer inspection, it turns out that all of the callers of disksort are already at splbio()...so back out the last change to disksort.
9982	07-Aug-1995	dg	Since buffers can be pulled off of the disk queue at interrupt time and disksort is called at non-interrupt time and can be actively traversing the list when that happens, there is a very small window of vulnerability. Close it by protecting disksort with splbio().
9980	07-Aug-1995	dg	Use bdwrite() rather than brelse(). The cylinder group bitmap modification is not preserved otherwise. Note that this is a no-op in FreeBSD, however, as we have doreallocblks disabled. Submitted by: Kirk McKusick
9968	06-Aug-1995	dg	Removed redundant call to vm_object_page_clean: this is already handled by vfs_msync().
9967	06-Aug-1995	dg	Removed redundant call to vm_object_page_clean - this is already done in vfs_msync().
9886	04-Aug-1995	dg	Use the correct flags (IO_SYNC -> B_SYNC) when deciding to do a sync or async write in the section that changes the filesize. The bug resulted in the updates always being async. Obtained from: 4.4BSD-Lite2
9842	01-Aug-1995	dg	Removed my special-case hack for VOP_LINK and fixed the problem with the wrong vp's ops vector being used by changing the VOP_LINK's argument order. The special-case hack doesn't go far enough and breaks the generic bypass routine used in some non-leaf filesystems. Pointed out by Kirk McKusick.
9759	29-Jul-1995	bde	Eliminate sloppy common-style declarations. There should be none left for the LINT configuation.
9618	21-Jul-1995	dg	Since ufs_ihashget can block, the lock must be checked for each time the function returns. Also, moved lock into .bss and made minor cosmetic changes. Submitted by: Bruce Evans
9601	21-Jul-1995	dg	Implement a lock in ffs_vget to prevent a race condition where two processes try allocate the same inode/vnode, causing a duplicate. Submitted by: Matt Dillon, slightly reworked by me.
9507	13-Jul-1995	dg	NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct proc or any VM system structure will have to be rebuilt!!! Much needed overhaul of the VM system. Included in this first round of changes: 1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages, haspage, and sync operations are supported. The haspage interface now provides information about clusterability. All pager routines now take struct vm_object's instead of "pagers". 2) Improved data structures. In the previous paradigm, there is constant confusion caused by pagers being both a data structure ("allocate a pager") and a collection of routines. The idea of a pager structure has escentially been eliminated. Objects now have types, and this type is used to index the appropriate pager. In most cases, items in the pager structure were duplicated in the object data structure and thus were unnecessary. In the few cases that remained, a un_pager structure union was created in the object to contain these items. 3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now be removed. For instance, vm_object_enter(), vm_object_lookup(), vm_object_remove(), and the associated object hash list were some of the things that were removed. 4) simple_lock's removed. Discussion with several people reveals that the SMP locking primitives used in the VM system aren't likely the mechanism that we'll be adopting. Even if it were, the locking that was in the code was very inadequate and would have to be mostly re-done anyway. The locking in a uni-processor kernel was a no-op but went a long way toward making the code difficult to read and debug. 5) Places that attempted to kludge-up the fact that we don't have kernel thread support have been fixed to reflect the reality that we are really dealing with processes, not threads. The VM system didn't have complete thread support, so the comments and mis-named routines were just wrong. We now use tsleep and wakeup directly in the lock routines, for instance. 6) Where appropriate, the pagers have been improved, especially in the pager_alloc routines. Most of the pager_allocs have been rewritten and are now faster and easier to maintain. 7) The pagedaemon pageout clustering algorithm has been rewritten and now tries harder to output an even number of pages before and after the requested page. This is sort of the reverse of the ideal pagein algorithm and should provide better overall performance. 8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup have been removed. Some other unnecessary casts have also been removed. 9) Some almost useless debugging code removed. 10) Terminology of shadow objects vs. backing objects straightened out. The fact that the vm_object data structure escentially had this backwards really confused things. The use of "shadow" and "backing object" throughout the code is now internally consistent and correct in the Mach terminology. 11) Several minor bug fixes, including one in the vm daemon that caused 0 RSS objects to not get purged as intended. 12) A "default pager" has now been created which cleans up the transition of objects to the "swap" type. The previous checks throughout the code for swp->pg_data != NULL were really ugly. This change also provides the rudiments for future backing of "anonymous" memory by something other than the swap pager (via the vnode pager, for example), and it allows the decision about which of these pagers to use to be made dynamically (although will need some additional decision code to do this, of course). 13) (dyson) MAP_COPY has been deprecated and the corresponding "copy object" code has been removed. MAP_COPY was undocumented and non- standard. It was furthermore broken in several ways which caused its behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will continue to work correctly, but via the slightly different semantics of MAP_PRIVATE. 14) (dyson) Sharing maps have been removed. It's marginal usefulness in a threads design can be worked around in other ways. Both #12 and #13 were done to simplify the code and improve readability and maintain- ability. (As were most all of these changes) TODO: 1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing this will reduce the vnode pager to a mere fraction of its current size. 2) Rewrite vm_fault and the swap/vnode pagers to use the clustering information provided by the new haspage pager interface. This will substantially reduce the overhead by eliminating a large number of VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be improved to provide both a "behind" and "ahead" indication of contiguousness. 3) Implement the extended features of pager_haspage in swap_pager_haspage(). It currently just says 0 pages ahead/behind. 4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps via a much more general mechanism that could also be used for disk striping of regular filesystems. 5) Do something to improve the architecture of vm_object_collapse(). The fact that it makes calls into the swap pager and knows too much about how the swap pager operates really bothers me. It also doesn't allow for collapsing of non-swap pager objects ("unnamed" objects backed by other pagers).
9356	28-Jun-1995	dg	1) Converted v_vmdata to v_object. 2) Removed unnecessary vm_object_lookup()/pager_cache(object, TRUE) pairs after vnode_pager_alloc() calls - the object is already guaranteed to be persistent. 3) Removed some gratuitous casts.
9354	28-Jun-1995	dg	Fixed VOP_LINK argument order botch.
8876	30-May-1995	rgrimes	Remove trailing whitespace.
8831	29-May-1995	phk	Mount MFS as root RW. Remounting doesn't make sense. Reviewed by: davidg
8805	28-May-1995	dg	Kill bogus vnode_pager_setsize(). It was being called at the wrong time and resulted in the object size being too small. This caused bad things to happen later when the file was mapped. Reviewed by: John Dyson
8692	21-May-1995	dg	Changes to fix the following bugs: 1) Files weren't properly synced on filesystems other than UFS. In some cases, this lead to lost data. Most likely would be noticed on NFS. The fix is to make the VM page sync/object_clean general rather than in each filesystem. 2) Mixing regular and mmaped file I/O on NFS was very broken. It caused chunks of files to end up as zeroes rather than the intended contents. The fix was to fix several race conditions and to kludge up the "b_dirtyoff" and "b_dirtyend" that NFS relies upon - paying attention to page modifications that occurred via the mmapping. Reviewed by: David Greenman Submitted by: John Dyson
8624	19-May-1995	dg	NFS diskless operation was broken because swapdev_vp wasn't initialized. These changes solve the problem in a general way by moving the initialization out of the individual fs_mountroot's and into swaponvp(). Submitted by: Poul-Henning Kamp
8530	15-May-1995	dg	Fixed incompleteness that would allow dirty filesystems to get mounted when the single user shell was terminated. These changes disallow mounting or R/W upgrading filesystems that are dirty unless "-f" (force) option is used with mount. /etc/rc has been modified to abort the startup if one or more non-nfs partitions fail to mount. Reviewed by: Poul-Henning Kamp, Rod Grimes
8529	15-May-1995	dg	From Bruce Evans: I ran into another manifestation of the problem reported in PR 211 and fixed it. Try this: as non-root: cd /tmp; mkdir x y x/z as root: chown root /tmp/x/z as non-root: cd /tmp/x; mv z ../y # EACCES as expected as root: cd /tmp/x; mv z ../y # EINVAL NOT as expected This is because ufs_rename() sets IN_RENAME and fails to clear it. Reviewed by: davidg Submitted by: bde
8456	11-May-1995	rgrimes	Fix -Wformat warnings from LINT kernel.
8210	01-May-1995	dyson	Limit filesize to the amount that the VM system can currently handle (2GB). If this limit is not imposed, then filesystem corruption will ensue when files larger than 2GB are created. This is temporary, and the underlying limitation will be removed later.
8054	25-Apr-1995	phk	Add a printf so we can see where we get our rootfs from.
8053	25-Apr-1995	dyson	Fixed the mmap hang fix previously committed so that it works with options DIAGNOSTIC, and clear up an additional reference count problem.
8041	24-Apr-1995	dyson	Changes to get rid of ufslk2 hangs when doing read/write to/from mmap regions that are in the same file as the read/write.
7876	16-Apr-1995	dg	Make vegetarian and animal rights people happy and use 0xdeadc0de instead of 0xdeadbeef as the 'spare' value.
7752	11-Apr-1995	dg	Handle the "syncing VCHR vnode hang" problem a little differently; just don't lock the vnode - it doesn't appear to ever be necessary for VCHR vnode/inodes. This fixes a bug introduced in the previous commit that caused tty timestamps to act strange (causing 'w' and 'finger' to show the tty wasn't idle when it may have been for hours).
7695	09-Apr-1995	dg	Changes from John Dyson and myself: Fixed remaining known bugs in the buffer IO and VM system. vfs_bio.c: Fixed some race conditions and locking bugs. Improved performance by removing some (now) unnecessary code and fixing some broken logic. Fixed process accounting of # of FS outputs. Properly handle NFS interrupts (B_EINTR). (various) Replaced calls to clrbuf() with calls to an optimized routine called vfs_bio_clrbuf(). (various FS sync) Sync out modified vnode_pager backed pages. ffs_vnops.c: Do two passes: Sync out file data first, then indirect blocks. vm_fault.c: Fixed deadly embrace caused by acquiring locks in the wrong order. vnode_pager.c: Changed to use buffer I/O system for writing out modified pages. This should fix the problem with the modification date previous not getting updated. Also dramatically simplifies the code. Note that this is going to change in the future and be implemented via VOP_PUTPAGES(). vm_object.c: Fixed a pile of bugs related to cleaning (vnode) objects. The performance of vm_object_page_clean() is terrible when dealing with huge objects, but this will change when we implement a binary tree to keep the object pages sorted. vm_pageout.c: Fixed broken clustering of pageouts. Fixed race conditions and other lockup style bugs in the scanning of pages. Improved performance.
7430	28-Mar-1995	bde	Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) that I didn't notice when I fixed "all" such warnings before.
7399	26-Mar-1995	dg	Removed third arg (vmio) to allocbuf() that was added with the original merged cache changes, and figure it out based on the B_VMIO buffer flag. Fixes a problem where delayed write VMIO buffers would sometimes get recopied into kernel-alloced memory. Submitted by: John Dyson
7170	19-Mar-1995	dg	Removed redundant newlines that were in some panic strings.
7169	19-Mar-1995	dg	Backed out change to panic call: As Chris just pointed out to me, panic() does indeed work like printf(). gdb gets the string untranslated for some reason.
7156	19-Mar-1995	dg	Fix a call to panic: panic doesn't do token substitution on the panic string.
7145	18-Mar-1995	dg	Don't sync the inode date changes of character special devices during the FS sync. The system would appear to hang momentarily if there was a large backlog of I/O. This is because the vnode remains locked during the output - preventing normal character I/O. The problem was exacerbated by the FFS contiguous block allocation fixes and a semi-broken disksort(). The inode/date will still be synced during a normal FS dismount and whenever the inode is changed for other reasons.
7133	18-Mar-1995	dg	Woops, add back that #define...it's used later in the file.
7126	18-Mar-1995	dg	Fixed comments and removed b_cylinder #define.
7125	18-Mar-1995	dg	Integrated change from 1.1.5: Fixed broken disksort to sort by pblkno rather than by cylinder.
7090	16-Mar-1995	bde	Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) and most of the warnings from `gcc -Wnested-externs'. Fix all the bugs found. There were no serious ones.
7018	12-Mar-1995	bde	Finish the previous change. The device name got lost in diskerr().
7006	11-Mar-1995	dg	Removed gratuitous and extremely evil setting of OBJ_INTERNAL. This caused a cascade of problems including kernel memory corruption, file corruption, system hangs, and panics.
6994	10-Mar-1995	dg	Increased default minfree to 8%.
6993	10-Mar-1995	dg	The threshold for switching from time-space and space-time is too small when minfree is 5%...so make it stay at space in this case. Submitted by: Kirk McKusick
6992	10-Mar-1995	dg	Patch to fix quota panic from Mike Karels: allow Q_SYNC regardless of "target" uid, we allow it with -1; fix bug that caused all ops to refer to user quotas, not group. Submitted by: Mike Karels
6875	04-Mar-1995	dg	Removed obsolete vtrace() remnants.
6864	03-Mar-1995	dg	Fixes from John Dyson to work around vnode lock hang. Basically, remove the VOP_BMAP calls, and add one to bdwrite. Submitted by: John Dyson
6769	27-Feb-1995	se	Don't try to make use of useless rotational position optimisation, if all free blocks are in the same bucket (i.e. NRPOS == 1). Else a free block is choosen, possibly from a different cylinder, even if the block succeeding bpref was free ... Submitted by: se
6640	22-Feb-1995	bde	Use dsname() to get consistent names.
6505	16-Feb-1995	bde	Adjust slice names in diskerr() for the rearranged slice numbers. The mapping from numbers to names is messy for backwards compatibility. E.g., for driver "sd", unit "0": slice 0: omit the slice number for compatibility; names are sd0[a-h]. slice 1: omit the partition letter 'c' because the whole disk device shouldn't have anything to do with partitions; sd0 is the only name. slices 2-31: subtract 1 from slice number to compensate for the compatibility slice 0; names are sd0s[1-30][a-h].
6357	14-Feb-1995	phk	YF fix.
6151	03-Feb-1995	dg	Fixed bmap run-length brokeness. Use bmap run-length extension when doing clustered paging. Submitted by: John Dyson
5840	24-Jan-1995	dg	Removed some unused/obsolete code. Submitted by: John Dyson
5455	09-Jan-1995	dg	These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
5392	04-Jan-1995	gibbs	Change panic messges that are ffs_blah functions to say they are ffs not ufs functions.
5391	04-Jan-1995	gibbs	LFS stability patches. There is still a problem with directory update ordering that can prove fatal during large batches of deletes, but this is much better than it was. I probably won't be putting much more time into this until Seltzer releases her new version of LFS which has fragment support. This should be availible just before USENIX.
5248	27-Dec-1994	bde	Use the same current time throughout ffs_update(). Update some macro names in comments. Don't use MNT_WAIT for something not related to mounting.
5247	27-Dec-1994	bde	Use the same current time throughout ITIMES(). I want all current timestamps for an atomic operation such as rename() on a local file system to be identical. Uniformize yet another idempotency ifdef. The comment nesting was bogus.
5185	22-Dec-1994	bde	Print `slicename' and not a bogus pointer in diskerr()
5126	16-Dec-1994	bde	Duplicate readdisklabel() and writedisklabel() and remove DOS stuff from from the copies to create correct_readdisklabel() and correct_writedisklabel(). Print the slice number in diskerr() if it is nonzero.
4827	26-Nov-1994	bde	Submitted by: Kirk McKusick Allow chown() to return success if the gid isn't changed even if the gid is not the caller's. Such gids are normal for files created in world-writable directories sucj as /tmp. This "fixes" annoying error messages for mv'ing files created in /tmp to another file system. mv still preserves the foreign gid of /tmp, but now does it silently.
4535	17-Nov-1994	gibbs	John Dyson's patches (and a few from me too) to LFS to use a different buffering scheme and make it more in tune with FreeBSD's vfs_bio implementation. The filesystem seems fairly stable, but I wouldn't recommend it to anyone not willing to experience problems. This is very green code and has the limitation that YOU CAN ONLY HAVE ONE LFS PARTITION MOUNTED AT A TIME. What LFS is good for: Non fsynced writes FASTER THAN FFS Large deletions Increadibly fast Reads are a little bit slower than FFS right now, but that is a factor of how under optimized this code is. LFS should in theory perform at least as well as FFS under fsync (iozone) type loads, and this is what I'm currently working on. Reviewed by: Justin Gibbs Submitted by: John Dyson Obtained from:
4464	14-Nov-1994	bde	Remove unused `struct disklabel' (the declarations that used it went away). Uniformize idempotency ifdef.
4463	14-Nov-1994	bde	Undo a previous change. <sys/disklabel.h> was broken, not these files.
3962	28-Oct-1994	jkh	From: fredriks@mcs.com (Lars Fredriksen) ... It turns out that these files do not include <sys/dkbad.h> before <sys/disklabel.h>. Submitted by: fredriks
3940	27-Oct-1994	jkh	Julian Elischer's disklabel fixes.
3768	22-Oct-1994	dg	Restrict fs_maxfilesize to 2^40, and check against this in ffs_truncate(). This is part of a bug fix from Kirk McKusick to work around problems in FFS related to the blkno of a 64bit offset not fitting into an int. Note the proper solution would be to deal with 64bit block numbers, but doing this would require sweeping changes; some other day perhaps. Submitted by: Marshall Kirk McKusick
3745	21-Oct-1994	wollman	Make my ALLDEVS kernel compile (basically, LINT minus a lot of options). This involves fixing a few things I broke last time.
3653	17-Oct-1994	phk	This basically allows you to stick a disklabel on any partition. For it to be useful, you must stick your disklabel on the partition which starts where the MBR says FreeBSD lives. If you don't do that, you might get a bad day. Oh, that probably also means that putting swap there is a bad idea...
3605	15-Oct-1994	ache	Add back variable declaration removed by wrong previous cleanups
3604	15-Oct-1994	ache	Add back variable declaration removed by wrong prevous cleanups.
3487	10-Oct-1994	phk	Cosmetics. make gcc less noisy. Still some way to go here.
3451	09-Oct-1994	dg	Got rid of map.h. It's a leftover from the rmap code, and we use rlists. Changed swapmap into swaplist.
3427	08-Oct-1994	phk	POSSIBLE BOGUS CODE found, (related to dos-partitions) in ufs_disksubr.c, look for CC_WALL. Cosmetics, a couple of unused vars.
3425	08-Oct-1994	phk	Cosmetics for gcc -Wall. A couple of unused "int i"'s removed and a couple of prototypes added. And the usual () work.
3420	08-Oct-1994	phk	Cosmetics.
3396	06-Oct-1994	dg	Use tsleep() rather than sleep so that 'ps' is more informative about the wait.
3167	28-Sep-1994	dfr	Make NFS ask the filesystems for directory cookies instead of making them itself.
3148	27-Sep-1994	phk	Moved the "relookup" routine into vfs_lookup.c from ufs/ufs/ufs_vnops.c. Several FS's use this, so it doesn't belong in ufs. (unionfs, msdosfs and ufs)
3103	25-Sep-1994	dg	Removed unimplemented subr_rmap.c and unused references to it.
2979	22-Sep-1994	wollman	More loadable VFS changes: - Make a number of filesystems work again when they are statically compiled (blush) - FIFOs are no longer optional; ``options FIFO'' removed from distributed config files.
2967	22-Sep-1994	wollman	Call ffs ``ufs'' for the benefit of poor, confused user-land programs.
2946	21-Sep-1994	wollman	Implemented loadable VFS modules, and made most existing filesystems loadable. (NFS is a notable exception.)
2922	20-Sep-1994	bde	Use `1' for a boolean value instead of something irrelevant (MNT_WAIT) that happens to be nonzero.
2689	12-Sep-1994	dg	Eliminated a whole pile of ancient (we're taking 4.3BSD) VM system related #define constants. Corrected incorrect VM_MAX_KERNEL_ADDRESS. Reviewed by: John Dyson
2460	02-Sep-1994	dg	panic if length is < 0 in ffs_truncate().
2384	29-Aug-1994	dg	"bogus" fixes from 1.1.5 to work around some cache coherency problems.
2177	21-Aug-1994	paul	Made idempotent Reviewed by: Submitted by:
2176	21-Aug-1994	paul	Made idempotent Reviewed by: Submitted by:
2152	20-Aug-1994	dg	Implemented filesystem clean bit via: machdep.c: Changed printf's a little and call vfs_unmountall() if the sync was successful. cd9660_vfsops.c, ffs_vfsops.c, nfs_vfsops.c, lfs_vfsops.c: Allow dismount of root FS. It is now disallowed at a higher level. vfs_conf.c: Removed unused rootfs global. vfs_subr.c: Added new routines vfs_unmountall and vfs_unmountroot. Filesystems are now dismounted if the machine is properly rebooted. ffs_vfsops.c: Toggle clean bit at the appropriate places. Print warning if an unclean FS is mounted. ffs_vfsops.c, lfs_vfsops.c: Fix bug in selecting proper flags for VOP_CLOSE(). vfs_syscalls.c: Disallow dismounting root FS via umount syscall.
2142	20-Aug-1994	dg	1) cleaned up after Garrett - fixed more redundant declarations, changed use of timeout_t -> timeout_func_t in aha1542 and aha1742 drivers. 2) fix a bug in the portalfs that was uncovered by better prototyping - specifically, the time must be converted from timeval to timespec before storing in va_atime. 3) fixed/added some miscellaneous prototypes
2112	18-Aug-1994	wollman	Fix up some sloppy coding practices: - Delete redundant declarations. - Add -Wredundant-declarations to Makefile.i386 so they don't come back. - Delete sloppy COMMON-style declarations of uninitialized data in header files. - Add a few prototypes. - Clean up warnings resulting from the above. NB: ioconf.c will still generate a redundant-declaration warning, which is unavoidable unless somebody volunteers to make `config' smarter.
1960	08-Aug-1994	dg	Made lockf advisory locking code generic (rather than ufs specific), and use it in NFS. This is required both for diskless support and for POSIX compliance. Note: the support in NFS is only for the local node. Submitted by: based on work originally done by Yuval Yurom
1937	08-Aug-1994	dg	Changed B_AGE policy to work correctly in a world with relatively large buffer caches. The old policy generally ended up caching nothing.
1826	03-Aug-1994	dg	Changed occurrances of "itrunc" to "ffs_truncate" to make Bruce happy.
1821	02-Aug-1994	dg	Completed (hopefully) the kernel support for old style "fastlinks".
1817	02-Aug-1994	dg	Added $Id$
1549	25-May-1994	rgrimes	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
1542	24-May-1994	rgrimes	This commit was generated by cvs2svn to compensate for changes in r1541, which included commits to RCS files with non-trunk default branches.