369045 |
18-Jan-2021 |
emaste |
ffs: avoid creating corrupt extattrfile
This is part of r312416 / e6790841f749, suggested by ml@netfence.it, and at least means we will stop creating corrupt extattr that is not handled by some later versions.
PR: 244089
Git Hash: eebccaae36722f62bc8f05e6c71b867d69faca5f Git Author: emaste@FreeBSD.org |
367145 |
29-Oct-2020 |
brooks |
MFC r366911:
vmapbuf: don't smuggle address or length in buf
Instead, add arguments to vmapbuf. Since this argument is always a pointer use a type of void * and cast to vm_offset_t in vmapbuf. (In CheriBSD we've altered vm_fault_quick_hold_pages to take a pointer and check its bounds.)
In no other situtation does b_data contain a user pointer and vmapbuf replaces b_data with the actual mapping.
Suggested by: jhb Reviewed by: imp, jhb Obtained from: CheriBSD Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D26784 |
362050 |
11-Jun-2020 |
kib |
MFC r361785, r361801 (by mckusick), r361803 (by se), r361814 (by mckusick), r361875 (by mckusick): Fixes for UFS fdatasync(2). |
360030 |
17-Apr-2020 |
kib |
MFC r359766: ufs: apply suspension for non-forced rw unmounts. |
357034 |
23-Jan-2020 |
mckusick |
MFC of 356714
Fix DIRCHG panic |
357030 |
23-Jan-2020 |
mckusick |
MFC of 356739
Optimize quota sync'ing |
356905 |
20-Jan-2020 |
eugen |
MFC r323157 by 323157: fix recovery information with sector sizes up to 64K.
Original commit log:
The new fsck recovery information to enable it to find backup superblocks created in revision 322297 only works on disks with sector sizes up to 4K. This update allows the recovery information to be created by newfs and used by fsck on disks with sector sizes up to 64K. Note that FFS currently limits filesystem to be mounted from disks with up to 8K sectors. Expanding this limitation will be the subject of another commit.
For example, this allows newfs to work on GELI volumes with 8K sectors.
PR: 243413 Approved by: mckusick Relnotes: Yes |
352381 |
16-Sep-2019 |
kib |
MFC r352058: Remove some unneeded vfs_busy() calls in SU code. |
349308 |
23-Jun-2019 |
asomers |
MFC r348251:
Remove "struct ucred*" argument from vtruncbuf
vtruncbuf takes a "struct ucred*" argument. AFAICT, it's been unused ever since that function was first added in r34611. Remove it. Also, remove some "struct ucred" arguments from fuse and nfs functions that were only used by vtruncbuf.
Reviewed by: cem Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20377 |
346226 |
15-Apr-2019 |
kib |
MFC r346031: Handle races when remounting UFS volume from ro to rw. |
345667 |
29-Mar-2019 |
mckusick |
MFC of 343536, 345077, and 345352
Collectively fixing ffs_truncate3 and dangling dependencies panics when using ACLs.
Sponsored by: Netflix |
344861 |
07-Mar-2019 |
mckusick |
MFC of 344552 and 344732
Have fsck_ffs adjust size for files with hole at end
Tighten last lbn calculation
Sponsored by: Netflix |
344376 |
20-Feb-2019 |
kevans |
MFC r304850, r305480, r324550-r324551, r324655, r324684: correct mis-merge
Some of these commits were improperly MFC'd in the sys/boot => stand mega-MFC, others were simply missed. Correct that mistake now by manually merging the few that were missed and record-only merge on the others.
r304850: Unused variables and cstyle fix for loader dosfs
r305480: Renumber the advertising clause.
r324550: Add $FreeBSD$ to ancient sources that it's missing from.
r324551: Move lib/libstand to sys/boot/libsa
Move the sources to sys/boot. Make adjustments related to the move. Kill LIBSTAND_SRC since it's no longer needed.
r324655: Remove the libstand directory which is now empty.
r324684: Remove lib/libstand again, accidentally readded in r324683 |
342819 |
06-Jan-2019 |
mckusick |
MFC of 342548
When loading an inode from disk, verify that its mode is valid.
Sponsored by: Netflix |
342623 |
30-Dec-2018 |
kib |
MFC r342381: Allocate v_object for the new snapshot vnode. |
339010 |
29-Sep-2018 |
kib |
MFC r338892: Correct panic messages. |
337483 |
08-Aug-2018 |
kib |
MFC r337055: Avoid assertion in /dev/ufssuspend when the suspend ioctl is (incorrectly) called while another suspension is already active.
PR: 230220 |
331722 |
29-Mar-2018 |
eadler |
Revert r330897:
This was intended to be a non-functional change. It wasn't. The commit message was thus wrong. In addition it broke arm, and merged crypto related code.
Revert with prejudice.
This revert skips files touched in r316370 since that commit was since MFCed. This revert also skips files that require $FreeBSD$ property changes.
Thank you to those who helped me get out of this mess including but not limited to gonzo, kevans, rgrimes.
Requested by: gjb (re) |
331017 |
15-Mar-2018 |
kevans |
MFC r317055,r317056 (glebius): Include sys/vmmeter.h as included
r317055: All these files need sys/vmmeter.h, but now they got it implicitly included via sys/pcpu.h.
r317056: Typo! |
330897 |
14-Mar-2018 |
eadler |
Partial merge of the SPDX changes
These changes are incomplete but are making it difficult to determine what other changes can/should be merged.
No objections from: pfg |
330446 |
05-Mar-2018 |
eadler |
MFC r327231,r327232:
kernel: Fix several typos and minor errors lib: Fix several typos and minor errors
- duplicate words - typos - references to old versions of FreeBSD |
329020 |
08-Feb-2018 |
kevans |
MFC r309062: Release laundered vnode pages to the head of the inactive queue.
The swap pager enqueues laundered pages near the head of the inactive queue to avoid another trip through LRU before reclamation. This change adds support for this behaviour to the vnode pager and makes use of it in UFS and ext2fs. Some ioflag handling is consolidated into a common subroutine so that this support can be easily extended to other filesystems which make use of the buffer cache. No changes are needed for ZFS since its putpages routine always undirties the pages before returning, and the laundry thread requeues the pages appropriately in this case. |
328047 |
16-Jan-2018 |
kib |
MFC r327723, r327821: Generalize the fix from r322757 and apply it to several more places. |
328046 |
16-Jan-2018 |
kib |
MFC r327722: When handling write completion, take SU lock around calls to handle_written_XXX() in case of processing the buffer with an error. |
328045 |
16-Jan-2018 |
kib |
MFC r327721: Postpone the disassotiation of the background write buffer with devvp so that buf_complete() sees fully constructed buffer. |
326921 |
17-Dec-2017 |
markj |
MFC r326731: Provide a sysctl to force synchronous initialization of inode blocks. |
325402 |
04-Nov-2017 |
markj |
MFC r325051: Remove a stale and incorrect comment. |
325401 |
04-Nov-2017 |
markj |
MFC r325050: Remove workqueue items after updating the workqueue tail pointer. |
325281 |
01-Nov-2017 |
markj |
MFC r324992: Make drain_output() use bufobj_wwait(). |
324612 |
13-Oct-2017 |
jhb |
MFC 324039: Don't defer wakeup()s for completed journal workitems.
Normally wakeups() are performed for completed softupdates work items in workitem_free() before the underlying memory is free()'d. complete_jseg() was clearing the "wakeup needed" flag in work items to defer the wakeup until the end of each loop iteration. However, this resulted in the item being free'd before it's address was used with wakeup(). As a result, another part of the kernel could allocate this memory from malloc() and use it as a wait channel for a different "event" with a different lock. This triggered an assertion failure when the lock passed to sleepq_add() did not match the existing lock associated with the sleep queue. Fix this by removing the code to defer the wakeup in complete_jseg() allowing the wakeup to occur slightly earlier in workitem_free() before free() is called. |
323153 |
04-Sep-2017 |
kib |
MFC r322757, r322883: Avoid dereferencing potentially freed workitem in softdep_count_dependencies(). |
322829 |
24-Aug-2017 |
kib |
MFC r322756: Style. |
322806 |
23-Aug-2017 |
mckusick |
MFC of 322200, 322201, 322271, and 322297
322200: Remove (broken) search for alternate superblocks 322201: Show differences when alternate superblock fails to match 322271: Cleanup for 322200. 322297: Restore fsck_ffs ability to find alternate superblocks
Discussed with: kib, imp Differential Revision: https://reviews.freebsd.org/D11589 |
322130 |
07-Aug-2017 |
mckusick |
MFC r321816: Avoid reading a snapshot block when it is already in the cache. |
322045 |
04-Aug-2017 |
kib |
MFC r321349: Improve publication of the newly allocated snapdata. |
322044 |
04-Aug-2017 |
kib |
MFC r321348: Unlock correct lock in ffs_snapblkfree(). |
322043 |
04-Aug-2017 |
kib |
MFC r321347: Account for lock recursion when transfering snaplock to the vnode lock in ffs_snapremove(). |
320057 |
17-Jun-2017 |
kib |
MFC r319539: Mitigate several problems with the softdep_request_cleanup() on busy host.
Approved by: re (gjb) |
319778 |
10-Jun-2017 |
kib |
MFC r319518: Ensure that cached struct thread does not keep spurious td_su reference on an UFS mount point.
MFC r319519: Clean possible td_su reference on the struct mount being unmounted as the last step of ffs_unmount().
Approved by: re (gjb) |
318266 |
14-May-2017 |
kib |
MFC r317908: Remove spl() calls from UFS code. |
315064 |
11-Mar-2017 |
kib |
MFC r314253: Do not leak mount references for dying threads. |
312072 |
13-Jan-2017 |
kib |
MFC r311522: Use type-independent formats for printing nlink_t and ino_t. |
309207 |
27-Nov-2016 |
kib |
MFC r308618: Provide simple mutual exclusion between mount point update and unmount. In the update path in ffs_mount(), drop vfs_busy() reference around namei(). |
309172 |
26-Nov-2016 |
mckusick |
MFC r308064: Avoid possible overflow when calclating malloc size for auxillary data structure sizes when mounting and reloading UFS/FFS filesystems. |
308554 |
11-Nov-2016 |
kib |
MFC r308026: Generalize UFS buffer pager.
MFC r308442: Tweaks for the buffer pager. |
308208 |
02-Nov-2016 |
kib |
MFC r307626: Add FFS pager, which uses buffer cache read operation to validate pages.
For now, the pager is disabled by default in the stable branch. |
307533 |
17-Oct-2016 |
mckusick |
MFC r304230: Add two new macros, SLIST_CONCAT and LIST_CONCAT.
MFC r304239: Bug 211013 reports that a write error to a UFS filesystem running with softupdates panics the kernel.
PR: 211013 |
306627 |
03-Oct-2016 |
kib |
MFC r305977: Be more strict when selecting between snapshot/regular mount. |
306553 |
01-Oct-2016 |
kib |
MFC r305902: Reduce size of ufs inode.
MFC r305903: Fix libprocstat build after r305902. |
306171 |
22-Sep-2016 |
kib |
MFC r305599: Do not leak transient ENOLCK error from flush_newblk_dep() loop. |
306167 |
22-Sep-2016 |
kib |
MFC r305594: In softdep_prealloc(), return early not only for snapshots, but for the quota files as well. |
306165 |
22-Sep-2016 |
kib |
MFC r305592: Partially lift suspension when ffs_reload() finished with cgs and going to re-read inodes. |
304984 |
29-Aug-2016 |
kib |
MFC r304180: Implement VOP_FDATASYNC() for UFS. |
304983 |
29-Aug-2016 |
kib |
MFC r303924 (by trasz): Eliminate vprint(). |
304666 |
23-Aug-2016 |
kib |
MFC r304232: In UFS_BALLOC(), invalidate pages of indirect buffers on failed block allocation unwinding. |
304665 |
23-Aug-2016 |
kib |
MFC r304231: On unwind after failed block allocation in ffs_balloc_ufs{1,2}, assert that recorded allocated blocks numbers match the physical block numbers of dandling buffers which are released. When finally freeing the blocks during unwind, assert that dandling buffers where not re-allocated. |
304664 |
23-Aug-2016 |
kib |
MFC r304229: When looking up dandling buffers for unwing after failing block allocation in UFS_BALLOC(), there is no need to map them. |
304663 |
23-Aug-2016 |
kib |
MFC r304228: When block allocation fails in UFS_BALLOC(), and the volume does not have SU enabled, there is no point in calling softdep_request_cleanup(). |
304662 |
23-Aug-2016 |
kib |
MFC r304227: In ffs_balloc_ufs{1,2} routines, assert that unwind records do not overflow local arrays. |
302408 |
08-Jul-2016 |
gjb |
Copy head@r302406 to stable/11 as part of the 11.0-RELEASE cycle. Prune svn:mergeinfo from the new branch, as nothing has been merged here.
Additional commits post-branch will follow.
Approved by: re (implicit) Sponsored by: The FreeBSD Foundation |
300423 |
22-May-2016 |
kevlo |
arc4random() returns 0 to (2**32)−1, use an alternative to initialize i_gen if it's zero rather than a divide by 2.
With inputs from delphij, mckusick, rmacklem
Reviewed by: mckusick
|
300366 |
21-May-2016 |
kib |
Stop dropping and reacquiring Giant around geom calls in UFS.
Sponsored by: The FreeBSD Foundation
|
300364 |
21-May-2016 |
kib |
Improve handling of rdev->si_mountpt on mount and unmount of FFS volumes. Treat the field as a semaphore protecting availability of the device for mounting. Do no access devvp->v_rdev without the vnode lock owned.
Protect change of the devvp->v_bufobj bo_ops vector with the vnode lock.
Reviewed by: bde Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
|
300084 |
17-May-2016 |
kib |
Do enable io accounting for read-only mounts and mounts which are remounted to writeable after initial read-only. Assign to dev->si_mountpt earlier to account the accesses done at the mount time.
Based on submission by: bde MFC after: 1 week
|
300083 |
17-May-2016 |
kib |
If IO_SYNC was passed to ffs_truncate(), request synchronous inode update from the final ffs_update().
Noted by: bde MFC after: 1 week
|
300030 |
17-May-2016 |
kib |
Fix comments.
Submitted by: bde MFC after: 1 week
|
298804 |
29-Apr-2016 |
pfg |
UFS: spelling fixes on comments.
No functional change.
|
298433 |
21-Apr-2016 |
pfg |
sys: use our roundup2/rounddown2() macros when param.h is available.
rounddown2 tends to produce longer lines than the original code and when the code has a high indentation level it was not really advantageous to do the replacement.
This tries to strike a balance between readability using the macros and flexibility of having the expressions, so not everything is converted.
|
297791 |
10-Apr-2016 |
pfg |
ufs: replace 0 with NULL for pointers.
While here also do late initialization of the variables we are changing.
Found with devel/coccinelle.
Reviewed by: mckusick MFC after: 2 weeks
|
297633 |
07-Apr-2016 |
trasz |
Add four new RCTL resources - readbps, readiops, writebps and writeiops, for limiting disk (actually filesystem) IO.
Note that in some cases these limits are not quite precise. It's ok, as long as it's within some reasonable bounds.
Testing - and review of the code, in particular the VFS and VM parts - is very welcome.
MFC after: 1 month Relnotes: yes Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D5080
|
297311 |
27-Mar-2016 |
kib |
Split the global taskqueue used to process all UFS trim completions, into per-mount taskqueue with the private taskqueue processing thread. This allows to drain the taskqueue on unmount, to ensure that all TRIMs are finished before mount structures are freed.
But just draining the taskqueue where TRIM biodone geom-up completions are processed is not enough, since ffs_blkfree(), called by the task, might result in more writes. Count inflight delayed blkfree's and pause() unmount until the counter drains as well.
Reported by: Nick Evans <nevans@talkpoint.com> Tested by: Nick Evans <nevans@talkpoint.com>, pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
|
297206 |
23-Mar-2016 |
kib |
Fix locking mistake in softdep_waitidle(). The surrounding code expects that the loop is always exited with the SU lock owned, even on error.
Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days
|
295950 |
24-Feb-2016 |
mckusick |
The UFS filesystem requires that the last block of a file always be allocated. When shortening the length of a file in which the new end of the file contains a hole, the hole must have a block allocated.
Reported by: Maxim Sobolev Reviewed by: kib Tested by: Peter Holm
|
294983 |
28-Jan-2016 |
trasz |
Remove ffs_mountroot() prototype; seems to be long gone.
MFC after: 1 month Sponsored by: The FreeBSD Foundation
|
294956 |
27-Jan-2016 |
mckusick |
This fixes a bug in UFS2 exported NFS volumes. An NFS client can crash a server that has exported UFS2 by presenting a filehandle with an inode number that references an uninitialized inode in a cylinder group. The problem is that UFS2 only initializes blocks of inodes as they are first allocated and ffs_fhtovp() does not validate that the inode is in a range of inodes that have been initialized. Attempting to read an uninitialized inode gets random data from the disk. When the kernel tries to interpret it as an inode, panics often arise.
Reported by: Christos Zoulas (from NetBSD) Reviewed by: kib
|
294954 |
27-Jan-2016 |
mckusick |
The bread() function was inconsistent about whether it would return a buffer pointer in the event of an error (for some errors it would return a buffer pointer and for other errors it would not return a buffer pointer). The cluster_read() function was similarly inconsistent.
Clients of these functions were inconsistent in handling errors. Some would assume that no buffer was returned after an error and would thus lose buffers under certain error conditions. Others would assume that brelse() should always be called after an error and would thus panic the system under certain error conditions.
To correct both of these problems with minimal code churn, bread() and cluster_write() now always free the buffer when returning an error thus ensuring that buffers will never be lost. The brelse() routine checks for being passed a NULL buffer pointer and silently returns to avoid panics. Thus both approaches to handling error returns from bread() and cluster_read() will work correctly.
Future code should be written assuming that bread() and cluster_read() will never return a buffer with an error, so should not attempt to brelse() the buffer when an error is returned.
Reviewed by: kib
|
292541 |
21-Dec-2015 |
kib |
Recheck curthread->td_su after the VFS_SYNC() call, and re-sync if the ast was rescheduled during VFS_SYNC(). It is possible that enough parallel writes or slow/hung volume result in VFS_SYNC() deferring to the ast flushing of workqueue.
Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week
|
291459 |
29-Nov-2015 |
mckusick |
For performance reasons, it is useful to have a single string used as the name of a filesystem when setting it as the first parameter to the getnewvnode() function. Most filesystems call getnewvnode from just one place so can use a literal string as the first parameter. However, NFS calls getnewvnode from two places, so we create a global constant string that can be used by the two instances. This change also collapses two instances of getnewvnode() in the UFS filesystem to a single call.
Reviewed by: kib Tested by: Peter Holm
|
290047 |
27-Oct-2015 |
kib |
Do not perform read-ahead for BA_CLRBUF request when we are low on memory or when dirty buffer queue is too large.
Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
|
289405 |
16-Oct-2015 |
imp |
Do not relocate extents to make them contiguous if the underlying drive can do deletions. Ability to do deletions is a strong indication that this optimization will not help performance. It will only generate extra write traffic. These devices are typically flash based and have a limited number of write cycles. In addition, making the file contiguous in LBA space doesn't improve the access times from flash devices because they have no seek time.
Reviewed by: mckusick@
|
288989 |
07-Oct-2015 |
glebius |
In softdep_setup_freeblocks(): - Move the bread() to the beginning of function. - Return if it fails, otherwise we will panic.
Submitted by: mckusick Sponsored by: Netflix
|
287483 |
05-Sep-2015 |
kib |
Do not consume extra reference. This is a bug in r287479.
Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week
|
287479 |
05-Sep-2015 |
kib |
Declare the writes around the call to VFS_SYNC() in softdep_ast_cleanup_proc().
Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week
|
287361 |
01-Sep-2015 |
kib |
By doing file extension fast, it is possible to create excess supply of the D_NEWBLK kinds of dependencies (i.e. D_ALLOCDIRECT and D_ALLOCINDIR), which can exhaust kmem.
Handle excess of D_NEWBLK in the same way as excess of D_INODEDEP and D_DIRREM, by scheduling ast to flush dependencies, after the thread, which created new dep, left the VFS/FFS innards. For D_NEWBLK, the only way to get rid of them is to do full sync, since items are attached to data blocks of arbitrary vnodes. The check for D_NEWBLK excess in softdep_ast_cleanup_proc() is unlocked.
For 32bit arches, reduce the total amount of allowed dependencies by two. It could be considered increasing the limit for 64 bit platforms with direct maps.
Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
|
285993 |
29-Jul-2015 |
jeff |
- Make 'struct buf *buf' private to vfs_bio.c. Having a global variable 'buf' is inconvenient and has lead me to some irritating to discover bugs over the years. It also makes it more challenging to refactor the buf allocation system. - Move swbuf and declare it as an extern in vfs_bio.c. This is still not perfect but better than it was before. - Eliminate the unused ffs function that relied on knowledge of the buf array. - Move the shutdown code that iterates over the buf array into vfs_bio.c.
Reviewed by: kib Sponsored by: EMC / Isilon Storage Division
|
285819 |
23-Jul-2015 |
jeff |
Refactor unmapped buffer address handling. - Use pointer assignment rather than a combination of pointers and flags to switch buffers between unmapped and mapped. This eliminates multiple flags and generally simplifies the logic. - Eliminate b_saveaddr since it is only used with pager bufs which have their b_data re-initialized on each allocation. - Gather up some convenience routines in the buffer cache for manipulating buf space and buf malloc space. - Add an inline, buf_mapped(), to standardize checks around unmapped buffers.
In collaboration with: mlaier Reviewed by: kib Tested by: pho (many small revisions ago) Sponsored by: EMC / Isilon Storage Division
|
285390 |
11-Jul-2015 |
mjg |
Move chdir/chroot-related fdp manipulation to kern_descrip.c
Prefix exported functions with pwd_.
Deduplicate some code by adding a helper for setting fd_cdir.
Reviewed by: kib
|
285182 |
05-Jul-2015 |
markj |
Check suspendability on the mountpoint returned by VOP_GETWRITEMOUNT. This obviates the need for a MNTK_SUSPENDABLE flag, since passthrough filesystems like nullfs and unionfs no longer need to inherit this information from their lower layer(s). This change also restores the pre-r273336 behaviour of using the presence of a susp_clean VFS method to request suspension support.
Reviewed by: kib, mjg Differential Revision: https://reviews.freebsd.org/D2937
|
284959 |
30-Jun-2015 |
markm |
Huge cleanup of random(4) code.
* GENERAL - Update copyright. - Make kernel options for RANDOM_YARROW and RANDOM_DUMMY. Set neither to ON, which means we want Fortuna - If there is no 'device random' in the kernel, there will be NO random(4) device in the kernel, and the KERN_ARND sysctl will return nothing. With RANDOM_DUMMY there will be a random(4) that always blocks. - Repair kern.arandom (KERN_ARND sysctl). The old version went through arc4random(9) and was a bit weird. - Adjust arc4random stirring a bit - the existing code looks a little suspect. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Redo read_random(9) so as to duplicate random(4)'s read internals. This makes it a first-class citizen rather than a hack. - Move stuff out of locked regions when it does not need to be there. - Trim RANDOM_DEBUG printfs. Some are excess to requirement, some behind boot verbose. - Use SYSINIT to sequence the startup. - Fix init/deinit sysctl stuff. - Make relevant sysctls also tunables. - Add different harvesting "styles" to allow for different requirements (direct, queue, fast). - Add harvesting of FFS atime events. This needs to be checked for weighing down the FS code. - Add harvesting of slab allocator events. This needs to be checked for weighing down the allocator code. - Fix the random(9) manpage. - Loadable modules are not present for now. These will be re-engineered when the dust settles. - Use macros for locks. - Fix comments.
* src/share/man/... - Update the man pages.
* src/etc/... - The startup/shutdown work is done in D2924.
* src/UPDATING - Add UPDATING announcement.
* src/sys/dev/random/build.sh - Add copyright. - Add libz for unit tests.
* src/sys/dev/random/dummy.c - Remove; no longer needed. Functionality incorporated into randomdev.*.
* live_entropy_sources.c live_entropy_sources.h - Remove; content moved. - move content to randomdev.[ch] and optimise.
* src/sys/dev/random/random_adaptors.c src/sys/dev/random/random_adaptors.h - Remove; plugability is no longer used. Compile-time algorithm selection is the way to go.
* src/sys/dev/random/random_harvestq.c src/sys/dev/random/random_harvestq.h - Add early (re)boot-time randomness caching.
* src/sys/dev/random/randomdev_soft.c src/sys/dev/random/randomdev_soft.h - Remove; no longer needed.
* src/sys/dev/random/uint128.h - Provide a fake uint128_t; if a real one ever arrived, we can use that instead. All that is needed here is N=0, N++, N==0, and some localised trickery is used to manufacture a 128-bit 0ULLL.
* src/sys/dev/random/unit_test.c src/sys/dev/random/unit_test.h - Improve unit tests; previously the testing human needed clairvoyance; now the test will do a basic check of compressibility. Clairvoyant talent is still a good idea. - This is still a long way off a proper unit test.
* src/sys/dev/random/fortuna.c src/sys/dev/random/fortuna.h - Improve messy union to just uint128_t. - Remove unneeded 'static struct fortuna_start_cache'. - Tighten up up arithmetic. - Provide a method to allow eternal junk to be introduced; harden it against blatant by compress/hashing. - Assert that locks are held correctly. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Turn into self-sufficient module (no longer requires randomdev_soft.[ch])
* src/sys/dev/random/yarrow.c src/sys/dev/random/yarrow.h - Improve messy union to just uint128_t. - Remove unneeded 'staic struct start_cache'. - Tighten up up arithmetic. - Provide a method to allow eternal junk to be introduced; harden it against blatant by compress/hashing. - Assert that locks are held correctly. - Fix the nasty pre- and post-read overloading by providing explictit functions to do these tasks. - Turn into self-sufficient module (no longer requires randomdev_soft.[ch]) - Fix some magic numbers elsewhere used as FAST and SLOW.
Differential Revision: https://reviews.freebsd.org/D2025 Reviewed by: vsevolod,delphij,rwatson,trasz,jmg Approved by: so (delphij)
|
284927 |
29-Jun-2015 |
kib |
Simplify code, no need to test the flag before clearing it.
Submitted by: ed MFC after: 12 days
|
284887 |
27-Jun-2015 |
kib |
Handle errors from background write of the cylinder group blocks.
First, on the write error, bufdone() call from ffs_backgroundwrite() panics because pbrelvp() cleared bp->b_bufobj, while brelse() would try to re-dirty the copy of the cg buffer. Handle this by setting B_INVAL for the case of BIO_ERROR.
Second, we must re-dirty the real buffer containing the cylinder group block data when background write failed. Real cg buffer was already marked clean in ffs_bufwrite(). After the BV_BKGRDINPROG flag is cleared on the real cg buffer in ffs_backgroundwrite(), buffer scan may reuse the buffer at any moment. The result is lost write, and if the write error was only transient, we get corrupted bitmaps.
We cannot re-dirty the original cg buffer in the ffs_backgroundwritedone(), since the context is not sleepable, preventing us from sleeping for origbp' lock. Add BV_BKGDERR flag (protected by the buffer object lock), which is converted into delayed write by brelse(), bqrelse() and buffer scan.
In collaboration with: Conrad Meyer <cse.cem@gmail.com> Reviewed by: mckusick Sponsored by: The FreeBSD Foundation (kib), EMC/Isilon storage division (Conrad) MFC after: 2 weeks
|
284495 |
17-Jun-2015 |
kib |
vfs_msync(), called from syncer vnode fsync VOP, only iterates over the active vnode list for the given mount point, with the assumption that vnodes with dirty pages are active. This is enforced by vinactive() doing vm_object_page_clean() pass over the vnode pages.
The issue is, if vinactive() cannot be called during vput() due to the vnode being only shared-locked, we might end up with the dirty pages for the vnode on the free list. Such vnode is invisible to syncer, and pages are only cleaned on the vnode reactivation. In other words, the race results in the broken guarantee that user data, written through the mmap(2), is written to the disk not later than in 30 seconds after the write.
Fix this by keeping the vnode which is freed but still owing inactivation, on the active list. When syncer loops find such vnode, it is deactivated and cleaned by the final vput() call.
Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
|
284446 |
16-Jun-2015 |
mjg |
Replace struct filedesc argument in getvnode with struct thread
This is is a step towards removal of spurious arguments.
|
283968 |
03-Jun-2015 |
kib |
Syncing a directory vnode might drop the vnode lock in the softdep_sync() similarly to the regular vnode sync. Allow retry for both vnode types.
Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week
|
283832 |
31-May-2015 |
kib |
Remove unused variable.
When deallocate_dependencies() is performed, softdep_journal_freeblocks() already called cancel_allocdirect() which should have eliminated direct dependencies for all truncated full blocks. The indirect dependencies are allowed above, since second- and third-level dependencies are only dealt with by the code which frees indirect block, which happens after the inode write.
Discussed with: mckusick, jeff Reviewed by: jeff Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
|
283735 |
29-May-2015 |
kib |
Remove several write-only variables, all reported by the gcc 4.9 buildkernel run.
Some of them were write-only under some kernel options, e.g. variables keeping values only used by CTR() macros. It costs nothing to the code readability and correctness to eliminate the warnings in those cases too by removing the local cached values used only for single-access.
Review: https://reviews.freebsd.org/D2665 Reviewed by: rodrigc Looked at by: bjk Sponsored by: The FreeBSD Foundation MFC after: 1 week
|
283604 |
27-May-2015 |
kib |
After r283600, NODELAY flag to inodedep_lookup() function is unused. Eliminate it, and simplify code by removing the local dflags variable always initialized to DEPALLOC.
Noted by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
|
283600 |
27-May-2015 |
kib |
Currently, softupdate code detects overstepping on the workitems limits in the code which is deep in the call stack, and owns several critical system resources, like vnode locks. Attempt to wait while the per-mount softupdate thread cleans up the backlog may deadlock, because the thread might need to lock the same vnode which is owned by the waiting thread.
Instead of synchronously waiting for the worker, perform the worker' tickle and pause until the backlog is cleaned, at the safe point during return from kernel to usermode. A new ast request to call softdep_ast_cleanup() is created, the SU code now only checks the size of queue and schedules ast.
There is no ast delivery for the kernel threads, so they are exempted from the mechanism, except NFS daemon threads. NFS server loop explicitely checks for the request, and informs the schedule_cleanup() that it is capable of handling the requests by the process P2_AST_SU flag. This is needed because nfsd may be the sole cause of the SU workqueue overflow. But, to not cause nsfd to spawn additional threads just because we slow down existing workers, only tickle su threads, without waiting for the backlog cleanup.
Reviewed by: jhb, mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
|
281956 |
24-Apr-2015 |
mckusick |
Limit the number of cylinder groups that will be searched when trying to build a cluster. The limit is tunable using the sysctl vfs.ffs.maxclustersearch. The current limit is 10 cylinder groups per block allocation. It was previously limited to the number of cylinder groups in the filesystem per block allocation. When there were no clusters of the needed size left, it repeatedly searched the whole filesystem for a non-existent cluster on every block allocation. The result was very slow filesystem allocation with 100% CPU utilization. The old behavior can be had by setting vfs.ffs.maxclustersearch to a huge number (1,000,000).
This change affects only the layout policy routines so is not able to interfere with the integrity of the filesystem.
Reported by: Dmitry Sivachenko (demon@) Tested by: Dmitry Sivachenko (demon@) MFC after: 2 weeks
|
281562 |
15-Apr-2015 |
rmacklem |
File systems that do not use the buffer cache (such as ZFS) must use VOP_FSYNC() to perform the NFS server's Commit operation. This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which is set by file systems that use the buffer cache. If this flag is not set, the NFS server always does a VOP_FSYNC(). This should be ok for old file system modules that do not set MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although it might not be optimal for file systems that use the buffer cache.
Reviewed by: kib MFC after: 2 weeks
|
280763 |
27-Mar-2015 |
kib |
Fix build (with gcc).
Reported by: bz, ian Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
|
280760 |
27-Mar-2015 |
kib |
Fix the hand after the immediate reboot when the following command sequence is performed on UFS SU+J rootfs: cp -Rp /sbin/init /sbin/init.old mv -f /sbin/init.old /sbin/init
Hang occurs on the rootfs unmount. There are two issues:
1. Removed init binary, which is still mapped, creates a reference to the removed vnode. The inodeblock for such vnode must have active inodedep, which is (eventually) linked through the unlinked list. This means that ffs_sync(MNT_SUSPEND) cannot succeed, because number of softdep workitems for the mp is always > 0. FFS is suspended during unmount, so unmount just hangs.
2. As noted above, the inodedep is linked eventually. It is not linked until the superblock is written. But at the vfs_unmountall() time, when the rootfs is unmounted, the call is made to ffs_unmount()->ffs_sync() before vflush(), and ffs_sync() only calls ffs_sbupdate() after all workitems are flushed. It is masked for normal system operations, because syncer works in parallel and eventually flushes superblock. Syncer is stopped when rootfs unmounted, so ffs_sync() must do sb update on its own.
Correct the issues listed above. For MNT_SUSPEND, count the number of linked unlinked inodedeps (this is not a typo) and substract the count of such workitems from the total. For the second issue, the ffs_sbupdate() is called right after device sync in ffs_sync() loop.
There is third problem, occuring with both SU and SU+J. The softdep_waitidle() loop, which waits for softdep_flush() thread to clear the worklist, only waits 20ms max. It seems that the 1 tick, specified for msleep(9), was a typo.
Add fsync(devvp, MNT_WAIT) call to softdep_waitidle(), which seems to significantly help the softdep thread, and change the MNT_LAZY update at the reboot time to MNT_WAIT for similar reasons. Note that userspace cannot create more work while devvp is flushed, since the mount point is always suspended before the call to softdep_waitidle() in unmount or remount path.
PR: 195458 In collaboration with: gjb, pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
|
278257 |
05-Feb-2015 |
kib |
Partially revert r277922, avoid sleeping and do flush if we a awaken, instead of waiting for the FLUSH_* flags. Also, when requesting flush, do the wakeups unconditionally even when FLUSH_CLEANUP flag was already set.
Reported and tested by: dim, "Lundberg, Johannes" <johannes@brilliantservice.co.jp> Bisected by: dim MFC after: 2 weeks
|
277922 |
30-Jan-2015 |
kib |
When mounting SU-enabled mount point, wait until the softdep_flush() thread started and incremented the stat_flush_threads [1].
Unconditionally wakeup softdep_flush threads when needed, do not try to check wchan, which is racy and breaks abstraction.
Reported by and discussed with: glebius, neel Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
|
275897 |
18-Dec-2014 |
kib |
The VOP_LOOKUP() implementations for CREATE op do not put the name into namecache, to avoid cache trashing when doing large operations. E.g., tar archive extraction is not usually followed by access to many of the files created.
Right now, each VOP_LOOKUP() implementation explicitely knowns about this quirk and tests for both MAKEENTRY flag presence and op != CREATE to make the call to cache_enter(). Centralize the handling of the quirk into VFS, by deciding to cache only by MAKEENTRY flag in VOP. VFS now sets NOCACHE flag for CREATE namei() calls.
Note that the change in semantic is backward-compatible and could be merged to the stable branch, and is compatible with non-changed third-party filesystems which correctly handle MAKEENTRY.
Suggested by: Chris Torek <torek@pi-coral.com> Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
|
275856 |
17-Dec-2014 |
gleb |
Adjust printf format specifiers for dev_t and ino_t in kernel.
ino_t and dev_t are about to become uint64_t.
Reviewed by: kib, mckusick
|
274914 |
23-Nov-2014 |
glebius |
Merge from projects/sendfile:
o Provide a new VOP_GETPAGES_ASYNC(), which works like VOP_GETPAGES(), but doesn't sleep. It returns immediately, and will execute the I/O done handler function that must be supplied as argument. o Provide VOP_GETPAGES_ASYNC() for the FFS, which uses vnode_pager. o Extend pagertab to support pgo_getpages_async method, and implement this method for vnode_pager.
Reviewed by: kib Tested by: pho Sponsored by: Netflix Sponsored by: Nginx, Inc.
|
274906 |
23-Nov-2014 |
glebius |
Include required files directly instead of pollution via ufs/ufsmount.h.
Sponsored by: Nginx, Inc.
|
273967 |
02-Nov-2014 |
kib |
When non-forced unmount or remount rw->ro is performed, writes on UFS are not suspended. In particular, on the SU-enabled vulumes, there is no reason why, between the call to softdep_flushfiles() and softdep_waitidle(), SU work items cannot be queued.
Correct the condition to trigger the panic by only checking when forced operation is done. Convert direct panic() call into KASSERT(), there is no invalid on-disk data structures directly involved, so follow the usual debugging vs. non-debugging approach.
Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week
|
273336 |
20-Oct-2014 |
mjg |
Provide vfs suspension support only for filesystems which need it, take two.
nullfs and unionfs need to request suspension if underlying filesystem(s) use it. Utilize mnt_kern_flag for this purpose.
This is a fixup for 273271.
No strong objections from: kib Pointy hat to: mjg MFC after: 2 weeks
|
272952 |
11-Oct-2014 |
kib |
Do not set IN_ACCESS flag for read-only mounts. The IN_ACCESS survives remount in rw, also it is set for vnodes on rootfs before noatime can be set or clock is adjusted. All conditions result in wrong atime for accessed vnodes.
Submitted by: bde MFC after: 1 week
|
271619 |
15-Sep-2014 |
kib |
Provide the unique implementation for the VOP_GETPAGES() method used by ffs and ext2fs. Remove duplicated call to vm_page_zero_invalid(), done by VOP and by vm_pager_getpages(). Use vm_pager_free_nonreq().
Reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation MFC after: 6 weeks (after r271596)
|
271540 |
13-Sep-2014 |
alc |
We don't need an exclusive object lock on the expected execution path through {ext2,ffs}_getpages().
Reviewed by: kib, pfg MFC after: 6 weeks Sponsored by: EMC / Isilon Storage Division
|
270203 |
20-Aug-2014 |
kib |
Correct the test for condition to suspend UFS filesystem during unmount. There is no need to suspend read-only filesystem, while we need suspension on modificable mount point.
Reported by: rwatson Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week
|
269853 |
12-Aug-2014 |
kib |
Revision r269457 removed the Giant around mount and unmount code, but r269533, which was tested before r269457 was committed, implicitely relied on the Giant to protect the manipulations of the softdepmounts list. Use softdep global lock consistently to guarantee the list structure now.
Insert the new struct mount_softdeps into the softdepmounts only after it is sufficiently initialized, to prevent softdep_speedup() from accessing bare memory. Similarly, remove struct mount_softdeps for the unmounted filesystem from the tailq before destroying structure rwlock.
Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week
|
269674 |
07-Aug-2014 |
mckusick |
The SUJ journal is only prepared to handle full-size block numbers, so we have to adjust freeblk records to reflect the change to a full-size block. For example, suppose we have a block made up of fragments 8-15 and want to free its last two fragments. We are given a request that says: FREEBLK ino=5, blkno=14, lbn=0, frags=2, oldfrags=0 where frags are the number of fragments to free and oldfrags are the number of fragments to keep. To block align it, we have to change it to have a valid full-size blkno, so it becomes: FREEBLK ino=5, blkno=8, lbn=0, frags=2, oldfrags=6
Submitted by: Mikihito Takehara Tested by: Mikihito Takehara Reviewed by: Jeff Roberson MFC after: 1 week
|
269533 |
04-Aug-2014 |
mckusick |
Add support for multi-threading of soft updates.
Replace a single soft updates thread with a thread per FFS-filesystem mount point. The threads are associated with the bufdaemon process.
Reviewed by: kib Tested by: Peter Holm and Scott Long MFC after: 2 weeks Sponsored by: Netflix
|
268612 |
14-Jul-2014 |
kib |
Extract the code to put a filesystem into the suspended state (at the unmount time) in the helper vfs_write_suspend_umnt(). Use it instead of two inline copies in FFS.
Fix the bug in the FFS unmount, when suspension failed, the ufs extattrs were not reinitialized.
Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
|
267226 |
08-Jun-2014 |
kib |
Initialize the pbuf counter for directio using SYSINIT, instead of using a direct hook called from kern_vfs_bio_buffer_alloc(). Mark ffs_rawread.c as requiring both ffs and directio options to be compiled into the kernel. Add ffs_rawread.c to the list of ufs.ko module' sources.
In addition to stopping breaking the layering violation, it also allows to link kernel when FFS is configured as module and DIRECTIO is enabled.
One consequence of the change is that ffs_rawread.o is always linked into the module regardless of the DIRECTIO option. This is similar to the option QUOTA and ufs_quota.c.
Sponsored by: The FreeBSD Foundation MFC after: 1 week
|
267031 |
03-Jun-2014 |
jmg |
don't check fs_flags for _FLAGS_UPDATED as it is stored in fs_old_flags.. If you had a UFS2 FS that didn't have it's super block at SBLOCK_UFS2, you'll end up corrupting your FS as the superblock is updated and written to a different location...
makefs used to put the superblock at SBLOCK_UFS1 for UFS 2 FS's causing this issue...
Reviewed by: silience from mckusick MFC after: 1 week
|
265463 |
06-May-2014 |
scottl |
Due to reasons unknown at this time, the system can be forced to write a journal block even when there are no journal entries to be written. Until the root cause is found, handle this case by ensuring that a valid journal segment is always written.
Second, the data buffer used for writing journal entries was never being scrubbed of old data. Fix this.
Submitted by: Takehara Mikihito Obtained from: Netflix, Inc. MFC after: 3 days
|
263628 |
22-Mar-2014 |
mckusick |
Update comment to explain search order reverted to historical order in -r254996.
Suggested by: Pedro Giffuni <pfg@FreeBSD.org> MFC: 3 days
|
263233 |
16-Mar-2014 |
rwatson |
Update kernel inclusions of capability.h to use capsicum.h instead; some further refinement is required as some device drivers intended to be portable over FreeBSD versions rely on __FreeBSD_version to decide whether to include capability.h.
MFC after: 3 weeks
|
262814 |
06-Mar-2014 |
jeff |
- If we fail to do a non-blocking acquire of a buf lock while doing a waiting sync pass we need to do a blocking acquire and restart. Another thread, typically the buf daemon, may have this buf locked and if we don't wait we can fail to sync the file. This lead to a great variety of softdep panics because we rely on all dependencies being flushed before proceeding in several cases.
Reported by: pho Discussed with: mckusick Sponsored by: EMC / Isilon Storage Division MFC after: 2 weeks
|
262678 |
02-Mar-2014 |
pfg |
ufs: small formatting fixes.
Cleanup some extra space. Use of tabs vs. spaces. No functional change.
MFC after: 3 days Reviewed by: mckusick
|
260088 |
30-Dec-2013 |
mckusick |
Fine tune filesystem block allocations under low free-space conditions (-r254995) based on further operational experience.
Submitted by: Dmitry Sivachenko Fix Tested by: Dmitry Sivachenko MFC after: 2 weeks
|
258789 |
01-Dec-2013 |
mckusick |
We needlessly panic when trying to flush MKDIR_PARENT dependencies. We had previously tried to flush all MKDIR_PARENT dependencies (and all the NEWBLOCK pagedeps) by calling ffs_update(). However this will only resolve these dependencies in direct blocks. So very large directories with MKDIR_PARENT dependencies in indirect blocks had not yet gotten flushed. As the directory is in the midst of doing a complete sync, we simply defer the checking of the MKDIR_PARENT dependencies until the indirect blocks have been sync'ed.
Reported by: Shawn Wallbridge of imaginaryforces.com Tested by: John-Mark Gurney <jmg@funkthat.com> PR: 183424 MFC after: 2 weeks
|
258403 |
20-Nov-2013 |
jmg |
fix white space...
MFC after: 1 week
|
258402 |
20-Nov-2013 |
jmg |
fix a use after free, jsegdep_merge will free wk, avoid the next check...
CID: 1006098 Sponsored by: Imaginary Forces Reviewed by: mckusick MFC after: 1 week
|
257029 |
24-Oct-2013 |
pfg |
UFS2: make di_extsize unsigned.
di_extsize is the EA size and as such it should be unsigned. Adjust related types for consistency.
Reviewed by: mckusick (previous version) MFC after: 3 weeks
|
256860 |
21-Oct-2013 |
brooks |
Allow kernels without options SOFTUPDATES to build. This should fix the embedded tinderboxes.
Reviewed by: emaste
|
256845 |
21-Oct-2013 |
mckusick |
Fix build problem on ARM (which defaults to building without soft updates).
Reported by: Tinderbox Sponsored by: Netflix
|
256817 |
21-Oct-2013 |
mckusick |
Restructuring of the soft updates code to set it up so that the single kernel-wide soft update lock can be replaced with a per-filesystem soft-updates lock. This per-filesystem lock will allow each filesystem to have its own soft-updates flushing thread rather than being limited to a single soft-updates flushing thread for the entire kernel.
Move soft update variables out of the ufsmount structure and into their own mount_softdeps structure referenced by ufsmount field um_softdep. Eventually the per-filesystem lock will be in this structure. For now there is simply a pointer to the kernel-wide soft updates lock.
Change all instances of ACQUIRE_LOCK and FREE_LOCK to pass the lock pointer in the mount_softdeps structure instead of a pointer to the kernel-wide soft-updates lock.
Replace the five hash tables used by soft updates with per-filesystem copies of these tables allocated in the mount_softdeps structure.
Several functions that flush dependencies when too many are allocated in the kernel used to operate across all filesystems. They are now parameterized to flush dependencies from a specified filesystem. For now, we stick with the round-robin flushing strategy when the kernel as a whole has too many dependencies allocated.
While there are many lines of changes, there should be no functional change in the operation of soft updates.
Tested by: Peter Holm and Scott Long Sponsored by: Netflix
|
256812 |
20-Oct-2013 |
mckusick |
Fourth of several cleanups to soft dependency implementation. Add KASSERTS that soft dependency functions only get called for filesystems running with soft dependencies. Calling these functions when soft updates are not compiled into the system become panic's.
No functional change.
Tested by: Peter Holm and Scott Long Sponsored by: Netflix
|
256808 |
20-Oct-2013 |
mckusick |
Third of several cleanups to soft dependency implementation. Ensure that softdep_unmount() and softdep_setup_sbupdate() only get called for filesystems running with soft dependencies.
No functional change.
Tested by: Peter Holm and Scott Long Sponsored by: Netflix
|
256803 |
20-Oct-2013 |
mckusick |
Second of several cleanups to soft dependency implementation. Delete two unused functions in ffs_sofdep.c.
No functional change.
Tested by: Peter Holm and Scott Long Sponsored by: Netflix
|
256801 |
20-Oct-2013 |
mckusick |
First of several cleanups to soft dependency implementation. Convert three functions exported from ffs_softdep.c to static functions as they are not used outside of ffs_softdep.c.
No functional change.
Tested by: Peter Holm and Scott Long Sponsored by: Netflix
|
255219 |
05-Sep-2013 |
pjd |
Change the cap_rights_t type from uint64_t to a structure that we can extend in the future in a backward compatible (API and ABI) way.
The cap_rights_t represents capability rights. We used to use one bit to represent one right, but we are running out of spare bits. Currently the new structure provides place for 114 rights (so 50 more than the previous cap_rights_t), but it is possible to grow the structure to hold at least 285 rights, although we can make it even larger if 285 rights won't be enough.
The structure definition looks like this:
struct cap_rights { uint64_t cr_rights[CAP_RIGHTS_VERSION + 2]; };
The initial CAP_RIGHTS_VERSION is 0.
The top two bits in the first element of the cr_rights[] array contain total number of elements in the array - 2. This means if those two bits are equal to 0, we have 2 array elements.
The top two bits in all remaining array elements should be 0. The next five bits in all array elements contain array index. Only one bit is used and bit position in this five-bits range defines array index. This means there can be at most five array elements in the future.
To define new right the CAPRIGHT() macro must be used. The macro takes two arguments - an array index and a bit to set, eg.
#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)
We still support aliases that combine few rights, but the rights have to belong to the same array element, eg:
#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL) #define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)
#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)
There is new API to manage the new cap_rights_t structure:
cap_rights_t *cap_rights_init(cap_rights_t *rights, ...); void cap_rights_set(cap_rights_t *rights, ...); void cap_rights_clear(cap_rights_t *rights, ...); bool cap_rights_is_set(const cap_rights_t *rights, ...);
bool cap_rights_is_valid(const cap_rights_t *rights); void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src); void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src); bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);
Capability rights to the cap_rights_init(), cap_rights_set(), cap_rights_clear() and cap_rights_is_set() functions are provided by separating them with commas, eg:
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);
There is no need to terminate the list of rights, as those functions are actually macros that take care of the termination, eg:
#define cap_rights_set(rights, ...) \ __cap_rights_set((rights), __VA_ARGS__, 0ULL) void __cap_rights_set(cap_rights_t *rights, ...);
Thanks to using one bit as an array index we can assert in those functions that there are no two rights belonging to different array elements provided together. For example this is illegal and will be detected, because CAP_LOOKUP belongs to element 0 and CAP_PDKILL to element 1:
cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);
Providing several rights that belongs to the same array's element this way is correct, but is not advised. It should only be used for aliases definition.
This commit also breaks compatibility with some existing Capsicum system calls, but I see no other way to do that. This should be fine as Capsicum is still experimental and this change is not going to 9.x.
Sponsored by: The FreeBSD Foundation
|
254996 |
28-Aug-2013 |
mckusick |
In looking at block layouts as part of fixing filesystem block allocations under low free-space conditions (-r254995), determine that old block-preference search order used before -r249782 worked a bit better. This change reverts to that block-preference search order.
MFC after: 2 weeks
|
254995 |
28-Aug-2013 |
mckusick |
A performance problem was reported in PR kern/181226:
I have 25TB Dell PERC 6 RAID5 array. When it becomes almost full (10-20GB free), processes which write data to it start eating 100% CPU and write speed drops below 1MB/sec (normally to gives 400MB/sec). The revision at which it first became apparent was http://svnweb.freebsd.org/changeset/base/249782.
The offending change reserved an area in each cylinder group to store metadata. The new algorithm attempts to save this area for metadata and allows its use for non-metadata only after all the data areas have been exhausted. The size of the reserved area defaults to half of minfree, so the filesystem reports full before the data area can completely fill. However, in this report, the filesystem has had minfree reduced to 1% thus forcing the metadata area to be used for data. As the filesystem approached full, it had only metadata areas left to allocate. The result was that every block allocation had to scan summary data for 30,000 cylinder groups before falling back to searching up to 30,000 metadata areas.
The fix is to give up on saving the metadata areas once the free space reserve drops below 2%. The effect of this change is to use the old algorithm of just accepting the first available block that we find. Since most filesystems use the default 5% minfree, this will have no effect on their operation. For those that want to push to the limit, they will get their crappy block placements quickly.
Submitted by: Dmitry Sivachenko Fix Tested by: Dmitry Sivachenko PR: kern/181226 MFC after: 2 weeks
|
253974 |
05-Aug-2013 |
mckusick |
With the addition of journalled soft updates, the "newblk" structures persist much longer than previously. Historically we had at most 100 entries; now the count may reach a million. With the increased count we spent far too much time looking them up in the grossly undersized newblk hash table. Configure the newblk hash table to accurately reflect the number of entries that it must index.
Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks
|
253973 |
05-Aug-2013 |
mckusick |
To better understand performance problems with journalled soft updates, we need to collect the highest level of allocation for each of the different soft update dependency structures. This change collects these statistics and makes them available using `sysctl debug.softdep.highuse'.
Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks
|
253341 |
14-Jul-2013 |
mckusick |
Update to comments describing block allocation policy.
Submitted by: Bruce Evans
|
253280 |
12-Jul-2013 |
kib |
Only copy as much bytes as there in superblock, instead of the full block copy, when copying the superblock into the snapshot. UFS1 does not align superblock on the block boundary, and bcopy runs off the end of the buffer.
Reported by: Andre Albsmeier <Andre.Albsmeier@siemens.com> Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week
|
253106 |
09-Jul-2013 |
kib |
There are several code sequences like vfs_busy(mp); vfs_write_suspend(mp); which are problematic if other thread starts unmount between two calls. The unmount starts a write, while vfs_write_suspend() drain writers. On the other hand, unmount drains busy references, causing the deadlock.
Add a flag argument to vfs_write_suspend and require the callers of it to specify VS_SKIP_UNMOUNT flag, when the call is performed not in the mount path, i.e. the covered vnode is not locked. The suspension is not attempted if VS_SKIP_UNMOUNT is specified and unmount is in progress.
Reported and tested by: Andreas Longwitz <longwitz@incore.de> Sponsored by: The FreeBSD Foundation MFC after: 3 weeks
|
252527 |
02-Jul-2013 |
mckusick |
Make better use of metadata area by avoiding using it for data blocks that no should no longer immediately follow their indirect blocks.
MFC after: 2 weeks
|
252515 |
02-Jul-2013 |
pfg |
Style fix: spaces.
Cleanup the incomplete revert.
Reported by: bde MFC after: 4 weeks
|
252484 |
01-Jul-2013 |
pfg |
Change i_gen in UFS to an unsigned type.
Revert the simplification of the i_gen calculation. It is still a good idea to avoid zero values and for the case of old filesystems there is probably no advantage in using the complete 32 bits anyways.
Discussed with: bde MFC after: 4 weeks
|
252467 |
01-Jul-2013 |
pfg |
Change i_gen in UFS to an unsigned type.
Further simplify the i_gen calculation for older disks. Having a zero here is not really a problem and this is more similar to what is done in newfs_random().
Reported by: Xin Li MFC after: 4 weeks
|
252435 |
01-Jul-2013 |
pfg |
Change i_gen in UFS to an unsigned type.
In UFS, i_gen is a random generated value and there is not way for it to be negative. Actually, the value of i_gen is just used to match bit patterns and it is of not consequence if the values are signed or not.
Following other filesystems, set it to unsigned and use it as such,
Discussed by: mckusick Reviewed by: mckusick (previous version) MFC after: 4 weeks
|
251171 |
31-May-2013 |
jeff |
- Convert the bufobj lock to rwlock. - Use a shared bufobj lock in getblk() and inmem(). - Convert softdep's lk to rwlock to match the bufobj lock. - Move INFREECNT to b_flags and protect it with the buf lock. - Remove unnecessary locking around bremfree() and BKGRDINPROG.
Sponsored by: EMC / Isilon Storage Division Discussed with: mckusick, kib, mdf
|
250901 |
22-May-2013 |
mckusick |
Properly spell sentinel (missed in 250891) No functional changes.
Spotted by: Navdeep Parhar and Alexey Dokuchaev MFC after: 2 weeks
|
250897 |
22-May-2013 |
mckusick |
Add missing buffer releases (brelse) after bread calls that return an error. One could argue that returning a buffer even when it is not valid is incorrect, but bread has always returned a buffer valid or not.
Reviewed by: kib MFC after: 2 weeks
|
250895 |
22-May-2013 |
mckusick |
Add missing 28th element to softdep types name array.
Found by: Coverity Scan, CID 1007621 Reviewed by: kib MFC after: 2 weeks
|
250894 |
22-May-2013 |
mckusick |
Null a pointer after it is freed so that when it is returned the return value is NULL. Based on the returned flags, the return value should never be inspected in the case where NULL is returned, but it is good coding practice not to return a pointer to freed memory.
Found by: Coverity Scan, CID 1006096 Reviewed by: kib MFC after: 2 weeks
|
250892 |
22-May-2013 |
mckusick |
Remove a bogus check for a NULL buffer pointer. Add a KASSERT that it is not NULL.
Found by: Coverity Scan, CID 1009114 Reviewed by: kib MFC after: 2 weeks
|
250891 |
22-May-2013 |
mckusick |
Properly spell sentinel (not sintenel or sentinal). No functional changes.
Spotted by: kib MFC after: 2 weeks
|
250576 |
12-May-2013 |
eadler |
Fix several typos
PR: kern/176054 Submitted by: Christoph Mallon <christoph.mallon@gmx.de> MFC after: 3 days
|
249582 |
17-Apr-2013 |
gabor |
- Correct mispellings of the word occurrence
Submitted by: Christoph Mallon <christoph.mallon@gmx.de> (via private mail)
|
249218 |
06-Apr-2013 |
jeff |
Prepare to replace the buf splay with a trie:
- Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists. No consumers need to find them there and it complicates the tree. These flags are all FFS specific and could be moved out of the buf cache. - Use pbgetvp() and pbrelvp() to associate the background and journal bufs with the vp. Not only is this much cheaper it makes more sense for these transient bufs. - Fix the assertions in pbget* and pbrel*. It's not safe to check list pointers which were never initialized. Use the BX flags instead. We also check B_PAGING in reassignbuf() so this should cover all cases.
Discussed with: kib, mckusick, attilio Sponsored by: EMC / Isilon Storage Division
|
249064 |
03-Apr-2013 |
mckusick |
The code in clear_remove() and clear_inodedeps() skips one entry in the pagedep and inodedep hash tables. An entry in the table is skipped because 'pagedep_hash' and 'inodedep_hash' hold the size of the hash tables - 1.
The chance that this would have any operational failure is extremely unlikely. These funtions only need to find a single entry and are only called when there are too many entries. The chance that they would fail because all the entries are on the single skipped hash chain are remote.
Submitted by: Pedro Martelletto Reviewed by: kib MFC after: 2 weeks
|
248623 |
22-Mar-2013 |
mckusick |
The purpose of this change to the FFS layout policy is to reduce the running time for a full fsck. It also reduces the random access time for large files and speeds the traversal time for directory tree walks.
The key idea is to reserve a small area in each cylinder group immediately following the inode blocks for the use of metadata, specifically indirect blocks and directory contents. The new policy is to preferentially place metadata in the metadata area and everything else in the blocks that follow the metadata area.
The size of this area can be set when creating a filesystem using newfs(8) or changed in an existing filesystem using tunefs(8). Both utilities use the `-k held-for-metadata-blocks' option to specify the amount of space to be held for metadata blocks in each cylinder group. By default, newfs(8) sets this area to half of minfree (typically 4% of the data area).
This work was inspired by a paper presented at Usenix's FAST '13: www.usenix.org/conference/fast13/ffsck-fast-file-system-checker
Details of this implementation appears in the April 2013 of ;login: www.usenix.org/publications/login/april-2013-volume-38-number-2. A copy of the April 2013 ;login: paper can also be downloaded from: www.mckusick.com/publications/faster_fsck.pdf.
Reviewed by: kib Tested by: Peter Holm MFC after: 4 weeks
|
248521 |
19-Mar-2013 |
kib |
UFS support of the unmapped i/o for the user data buffers.
Sponsored by: The FreeBSD Foundation Tested by: pho, scottl, jhb, bf
|
248515 |
19-Mar-2013 |
kib |
Do not remap usermode pages into KVA for physio.
Sponsored by: The FreeBSD Foundation Tested by: pho
|
248283 |
14-Mar-2013 |
kib |
Some style fixes.
Sponsored by: The FreeBSD Foundation
|
248282 |
14-Mar-2013 |
kib |
Add currently unused flag argument to the cluster_read(), cluster_write() and cluster_wbuild() functions. The flags to be allowed are a subset of the GB_* flags for getblk().
Sponsored by: The FreeBSD Foundation Tested by: pho
|
248084 |
09-Mar-2013 |
attilio |
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes.
The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs.
The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example).
Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
|
247388 |
27-Feb-2013 |
kib |
The softdep freeblks workitem might hold a reference on the dquot. Current dqflush() panics when a dquot with with non-zero refcount is encountered. The situation is possible, because quotas are turned off before softdep workitem queue if flushed, due to the quota file writes might create softdep workitems.
Make the encountering an active dquot in dqflush() not fatal, return the error from quotaoff() instead. Ignore the quotaoff() failures when ffs_flushfiles() is called in the course of softdep_flushfiles() loop, until the last iteration. At the last loop, the quotas must be closed, and because SU workitems should be already flushed, the references to dquot are gone.
Sponsored by: The FreeBSD Foundation Reported and tested by: pho Reviewed by: mckusick MFC after: 2 weeks
|
247387 |
27-Feb-2013 |
kib |
An inode block must not be blockingly read while cg block is owned. The order is inode buffer lock -> snaplk -> cg buffer lock, reversing the order causes deadlocks.
Inode block must not be written while cg block buffer is owned. The FFS copy on write needs to allocate a block to copy the content of the inode block, and the cylinder group selected for the allocation might be the same as the owned cg block. The reserved block detection code in the ffs_copyonwrite() and ffs_bp_snapblk() is unable to detect the situation, because the locked cg buffer is not exposed to it.
In order to maintain the dependency between initialized inode block and the cg_initediblk pointer, look up the inode buffer in non-blocking mode. If succeeded, brelse cg block, initialize the inode block and write it. After the write is finished, reread cg block and update the cg_initediblk.
If inode block is already locked by another thread, let the another thread initialize it. If another thread raced with us after we started writing inode block, the situation is detected by an update of cg_initediblk. Note that double-initialization of the inode block is harmless, the block cannot be used until cg_initediblk is incremented.
Sponsored by: The FreeBSD Foundation In collaboration with: pho Reviewed by: mckusick MFC after: 1 month X-MFC-note: after r246877
|
246877 |
16-Feb-2013 |
mckusick |
The UFS2 filesystem allocates new blocks of inodes as they are needed. When a cylinder group runs short of inodes, a new block for inodes is allocated, zero'ed, and written to the disk. The zero'ed inodes must be on the disk before the cylinder group can be updated to claim them. If the cylinder group claiming the new inodes were written before the zero'ed block of inodes, the system could crash with the filesystem in an unrecoverable state.
Rather than adding a soft updates dependency to ensure that the new inode block is written before it is claimed by the cylinder group map, we just do a barrier write of the zero'ed inode block to ensure that it will get written before the updated cylinder group map can be written. This change should only slow down bulk loading of newly created filesystems since that is the primary time that new inode blocks need to be created.
Reported by: Robert Watson Reviewed by: kib Tested by: Peter Holm
|
246612 |
10-Feb-2013 |
kib |
Fix several unsafe pointer dereferences in the buffered_write() function, implementing the sysctl vfs.ffs.set_bufoutput (not used in the tree yet).
- The current directory vnode dereference is unsafe since fd_cdir could be changed and unreferenced, lock the filedesc around and vref the fd_cdir. - The VTOI() conversion of the fd_cdir is unsafe without first checking that the vnode is indeed from an FFS mount, otherwise the code dereferences a random memory. - The cdir could be reclaimed from under us, lock it around the checks. - The type of the fp vnode might be not a disk, or it might have changed while the thread was in flight, check the type.
Reviewed and tested by: mckusick MFC after: 2 weeks
|
246289 |
03-Feb-2013 |
mckusick |
For UFS2 i_blocks is unsigned. The current "sanity" check that it has gone below zero after the blocks in its inode are freed is a no-op which the compiler fails to warn about because of the use of the DIP macro. Change the sanity check to compare the number of blocks being freed against the value i_blocks. If the number of blocks being freed exceeds i_blocks, just set i_blocks to zero.
Reported by: Pedro Giffuni (pfg@) MFC after: 2 weeks
|
245286 |
11-Jan-2013 |
kib |
Add flags argument to vfs_write_resume() and remove vfs_write_resume_flags().
Sponsored by: The FreeBSD Foundation
|
244925 |
01-Jan-2013 |
kib |
The process_deferred_inactive() function locks the vnodes of the ufs mount, which means that is must not be called while the snaplock is owned. The vfs_write_resume(9) does call the function as the VFS_SUSP_CLEAN() method, which is too early and falls into the region still protected by snaplock.
Add yet another flag for the vfs_write_resume_flags() to avoid calling suspension cleanup handler after the suspend is lifted, and use it in the ffs_snapshot() call to vfs_write_resume.
Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
|
244795 |
28-Dec-2012 |
kib |
Make it possible to atomically resume writes on the mount and account the write start, by adding a variation of the vfs_write_resume(9) which accepts flags.
Use the new function to prevent a deadlock between parallel suspension and snapshotting a UFS mount. The ffs_snapshot() code performed vfs_write_resume() followed by vn_start_write() while owning the snaplock. If the suspension intervene between resume and vn_start_write(), the deadlock occured after the suspending thread tried to lock the snaplock, most typically during the write in the ffs_copyonwrite().
Reported and tested by: Andreas Longwitz <longwitz@incore.de> Reviewed by: mckusick MFC after: 2 weeks X-MFC-note: make the vfs_write_resume(9) function a macro after the MFC, in HEAD
|
244534 |
21-Dec-2012 |
attilio |
Fixup r218424: uio_yield() was scaling directly to userland priority. When kern_yield() was introduced with the possibility to specify a new priority, the behaviour changed by not lowering priority at all in the consumers, making the yielding mechanism highly ineffective for high priority kthreads like bufdaemon, syncer, vlrudaemon, etc. There are no evidences that consumers could bear with such change in semantic and this situation could finally lead to bugs similar to the ones fixed in r244240. Re-specify userland pri for kthreads involved.
Tested by: pho Reviewed by: kib, mdf MFC after: 1 week
|
243311 |
19-Nov-2012 |
attilio |
r16312 is not any longer real since many years (likely since when VFS received granular locking) but the comment present in UFS has been copied all over other filesystems code incorrectly for several times.
Removes comments that makes no sense now.
Reviewed by: kib MFC after: 3 days
|
243250 |
18-Nov-2012 |
trasz |
Fix build of kdump(1).
|
243245 |
18-Nov-2012 |
trasz |
Add UFS writesuspension mechanism, designed to allow userland processes to modify on-disk metadata for filesystems mounted for write.
Reviewed by: kib, mckusick Sponsored by: FreeBSD Foundation
|
243018 |
14-Nov-2012 |
jeff |
- Fix a truncation bug with softdep journaling that could leak blocks on crash. When truncating a file that never made it to disk we use the canceled allocation dependencies to hold the journal records until the truncation completes. Previously allocdirect dependencies on the id_bufwait list were not considered and their journal space could expire before the bitmaps were written. Cancel them and attach them to the freeblks as we do for other allocdirects. - Add KTR traces that were used to debug this problem. - When adding jsegdeps, always use jwork_insert() so we don't have more than one segdep on a given jwork list.
Sponsored by: EMC / Isilon Storage Division
|
242924 |
12-Nov-2012 |
jeff |
- Fix a bug that has existed since the original softdep implementation. When a background copy of a cg is written we complete any work associated with that bmsafemap. If new work has been added to the non-background copy of the buffer it will be completed before the next write happens. The solution is to do the rollbacks when we make the copy so only those dependencies that were present at the time of writing will be completed when the background write completes. This would've resulted in various bitmap related corruptions and panics. It also would've expired journal entries early causing journal replay to miss some records.
MFC after: 2 weeks
|
242833 |
09-Nov-2012 |
attilio |
Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag. Porters should refer to __FreeBSD_version 1000021 for this change as it may have happened at the same timeframe.
|
242815 |
09-Nov-2012 |
jeff |
- Correct rev 242734, segments can sometimes get stuck. Be a bit more defensive with segment state.
Reported by: b. f. <bf1783@googlemail.com>
|
242734 |
08-Nov-2012 |
jeff |
- Implement BIO_FLUSH support around journal entries. This will not 100% solve power loss problems with dishonest write caches. However, it should improve the situation and force a full fsck when it is unable to resolve with the journal. - Resolve a case where the journal could wrap in an unsafe way causing us to prematurely lose journal entries in very specific scenarios.
Discussed with: mckusick MFC after: 1 month
|
242520 |
03-Nov-2012 |
mckusick |
When a file is first being written, the dynamic block reallocation (implemented by ffs_reallocblks_ufs[12]) relocates the file's blocks so as to cluster them together into a contiguous set of blocks on the disk.
When the cluster crosses the boundary into the first indirect block, the first indirect block is initially allocated in a position immediately following the last direct block. Block reallocation would usually destroy locality by moving the indirect block out of the way to keep the data blocks contiguous. This change compensates for this problem by noting that the first indirect block should be left immediately following the last direct block. It then tries to start a new cluster of contiguous blocks (referenced by the indirect block) immediately following the indirect block.
We should also do this for other indirect block boundaries, but it is only important for the first one.
Suggested by: Bruce Evans MFC: 2 weeks
|
242492 |
02-Nov-2012 |
jeff |
- In cancel_mkdir_dotdot don't panic if the inodedep is not available. If the previous diradd had already finished it could have been reclaimed already. This would only happen under heavy dependency pressure.
Reported by: Andrey Zonov <zont@FreeBSD.org> Discussed with: mckusick MFC after: 1 week
|
242379 |
30-Oct-2012 |
trasz |
Fix problem with geom_label(4) not recognizing UFS labels on filesystems extended using growfs(8). The problem here is that geom_label checks if the filesystem size recorded in UFS superblock is equal to the provider (i.e. device) size. This check cannot be removed due to backward compatibility. On the other hand, in most cases growfs(8) cannot set fs_size in the superblock to match the provider size, because, differently from newfs(8), it cannot recompute cylinder group sizes.
To fix this problem, add another superblock field, fs_providersize, used only for this purpose. The geom_label(4) will attach if either fs_size (filesystem created with newfs(8)) or fs_providersize (filesystem expanded using growfs(8)) matches the device size.
PR: kern/165962 Reviewed by: mckusick Sponsored by: FreeBSD Foundation
|
242259 |
28-Oct-2012 |
trasz |
Fix two problems that caused instant panic when the device mounted with softupdates went away. Note that this does not fix the problem entirely; I'm committing it now to make it easier for someone to pick up the work.
Reviewed by: mckusick
|
241896 |
22-Oct-2012 |
kib |
Remove the support for using non-mpsafe filesystem modules.
In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems.
The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes.
Conducted and reviewed by: attilio Tested by: pho
|
241011 |
27-Sep-2012 |
mdf |
Fix up kernel sources to be ready for a 64-bit ino_t.
Original code by: Gleb Kurtsou
|
239065 |
05-Aug-2012 |
kib |
After the PHYS_TO_VM_PAGE() function was de-inlined, the main reason to pull vm_param.h was removed. Other big dependency of vm_page.h on vm_param.h are PA_LOCK* definitions, which are only needed for in-kernel code, because modules use KBI-safe functions to lock the pages.
Stop including vm_param.h into vm_page.h. Include vm_param.h explicitely for the kernel code which needs it.
Suggested and reviewed by: alc MFC after: 2 weeks
|
238697 |
22-Jul-2012 |
kevlo |
Use NULL instead of 0 for pointers
|
238029 |
02-Jul-2012 |
kib |
Extend the KPI to lock and unlock f_offset member of struct file. It now fully encapsulates all accesses to f_offset, and extends f_offset locking to other consumers that need it, in particular, to lseek() and variants of getdirentries().
Ensure that on 32bit architectures f_offset, which is 64bit quantity, always read and written under the mtxpool protection. This fixes apparently easy to trigger race when parallel lseek()s or lseek() and read/write could destroy file offset.
The already broken ABI emulations, including iBCS and SysV, are not converted (yet).
Tested by: pho No objections from: jhb MFC after: 3 weeks
|
237366 |
21-Jun-2012 |
kib |
Fix unbounded-length malloc, controlled from usermode. The added check is performed before exact size of the buffer is calculated, but the buffer cannot have size greater then the total space allocated for extended attributes. The existing check is executing with precise size, but it is too late, since buffer needs to be allocated in advance.
Also, adapt to uio_resid being of ssize_t type. Use lblktosize instead of multiplying by fs block size by hand as well.
Reported and tested by: pho MFC after: 1 week
|
236937 |
11-Jun-2012 |
mckusick |
In softdep_setup_inomapdep() we may have to allocate both inodedep and bmsafemap dependency structures in inodedep_lookup() and bmsafemap_lookup() respectively. The setup of these structures must be done while holding the soft-dependency mutex. If the inodedep is allocated first, it may be freed in the I/O completion callback when the mutex is released to allocate the bmsafemap. If the bmsafemap is allocated first, it may be freed in the I/O completion callback when the mutex is released to allocate the inodedep.
To resolve this problem, bmsafemap_lookup has had a parameter added that allows a pre-malloc'ed bmsafemap to be passed in so that it does not need to release the mutex to create a new bmsafemap. The softdep_setup_inomapdep() routine pre-malloc's a bmsafemap dependency before acquiring the mutex and starting to build the inodedep with a call to inodedep_lookup(). The subsequent call to bmsafemap_lookup() is passed this pre-allocated bmsafemap entry so that it need not release the mutex if it needs to create a new one.
Reported by: Peter Holm Tested by: Peter Holm MFC after: 1 week
|
236322 |
30-May-2012 |
kib |
Enable vn_io_fault() lock avoidance for UFS.
Tested by: pho MFC after: 2 months
|
235610 |
18-May-2012 |
mckusick |
Add missing `continue' statement at end of case.
Found by: Kevin Lo (kevlo@) MFC after: 1 week
|
234608 |
23-Apr-2012 |
trasz |
Remove unused thread argument from clear_inodeps() and clear_remove().
|
234605 |
23-Apr-2012 |
trasz |
Remove unused thread argument from vtruncbuf().
Reviewed by: kib
|
234537 |
21-Apr-2012 |
trasz |
Fix use-after-free introduced in r234036.
Reviewed by: mckusick Tested by: pho
|
234483 |
20-Apr-2012 |
mckusick |
This update uses the MNT_VNODE_FOREACH_ACTIVE interface that loops over just the active vnodes associated with a mount point to replace MNT_VNODE_FOREACH_ALL in the vfs_msync, ffs_sync_lazy, and qsync routines.
The vfs_msync routine is run every 30 seconds for every writably mounted filesystem. It ensures that any files mmap'ed from the filesystem with modified pages have those pages queued to be written back to the file from which they are mapped.
The ffs_lazy_sync and qsync routines are run every 30 seconds for every writably mounted UFS/FFS filesystem. The ffs_lazy_sync routine ensures that any files that have been accessed in the previous 30 seconds have had their access times queued for updating in the filesystem. The qsync routine ensures that any files with modified quotas have those quotas queued to be written back to their associated quota file.
In a system configured with 250,000 vnodes, less than 1000 are typically active at any point in time. Prior to this change all 250,000 vnodes would be locked and inspected twice every minute by the syncer. For UFS/FFS filesystems they would be locked and inspected six times every minute (twice by each of these three routines since each of these routines does its own pass over the vnodes associated with a mount point). With this change the syncer now locks and inspects only the tiny set of vnodes that are active.
Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks
|
234386 |
17-Apr-2012 |
mckusick |
Replace the MNT_VNODE_FOREACH interface with MNT_VNODE_FOREACH_ALL. The primary changes are that the user of the interface no longer needs to manage the mount-mutex locking and that the vnode that is returned has its mutex locked (thus avoiding the need to check to see if its is DOOMED or other possible end of life senarios).
To minimize compatibility issues for third-party developers, the old MNT_VNODE_FOREACH interface will remain available so that this change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH will be removed in head.
The reason for this update is to prepare for the addition of the MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the active vnodes associated with a mount point (typically less than 1% of the vnodes associated with the mount point).
Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks
|
234158 |
11-Apr-2012 |
mckusick |
Export vinactive() from kern/vfs_subr.c (e.g., make it no longer static and declare its prototype in sys/vnode.h) so that it can be called from process_deferred_inactive() (in ufs/ffs/ffs_snapshot.c) instead of the body of vinactive() being cut and pasted into process_deferred_inactive().
Reviewed by: kib MFC after: 2 weeks
|
234036 |
08-Apr-2012 |
trasz |
Fix panic in ffs_reload(), which may happen when read-only filesystem gets resized and then reloaded.
Reviewed by: kib, mckusick (earlier version) Sponsored by: The FreeBSD Foundation
|
234024 |
08-Apr-2012 |
mckusick |
Drop an unnecessary setting of si_mountpt when updating a UFS mount point. Clearly it must have been set when the mount was done.
Reviewed by: kib
|
233817 |
02-Apr-2012 |
mckusick |
A file cannot be deallocated until its last name has been removed and it is no longer referenced by a user process. The inode for a file whose name has been removed, but is still referenced at the time of a crash will still be allocated in the filesystem, but will have no references (e.g., they will have no names referencing them from any directory).
With traditional soft updates these unreferenced inodes will be found and reclaimed when the background fsck is run. When using journaled soft updates, the kernel must keep track of these inodes so that it can find and reclaim them during the cleanup process. Their existence cannot be stored in the journal as the journal only handles short-term events, and they may persist for days. So, they are tracked by keeping them in a linked list whose head pointer is stored in the superblock. The journal tracks them only until their linked list pointers have been commited to disk. Part of the cleanup process involves traversing the list of unreferenced inodes and reclaiming them.
This bug was triggered when confusion arose in the commit steps of keeping the unreferenced-inode linked list coherent on disk. Notably, a race between the link() system call adding a link-count to a file and the unlink() system call removing a link-count to the file. Here if the unlink() ran after link() had looked up the file but before link() had incremented the link-count of the file, the file's link-count would drop to zero before the link() incremented it back up to one. If the file was referenced by a user process, the first transition through zero made it appear that it should be added to the unreferenced-inode list when in fact it should not have been added. If the new name created by link() was deleted within a few seconds (with the file still referenced by a user process) it would legitimately be a candidate for addition to the unreferenced-inode list. The result was that there were two attempts to add the same inode to the unreferenced-inode list which scrambled the unreferenced-inode list's pointers leading to a panic. The fix is to detect and avoid the false attempt at adding it to the unreferenced-inode list by having the link() system call check to see if the link count is zero before it increments it. If it is, the link() fails with ENOENT (showing that it has failed the link()/unlink() race).
While tracking down this bug, we have added additional assertions to detect the problem sooner and also simplified some of the code.
Reported by: Kirk Russell Fix submitted by: Jeff Roberson Tested by: Peter Holm PR: kern/159971 MFC (to 9 only): 2 weeks
|
233629 |
28-Mar-2012 |
mckusick |
A refinement of change 232351 to avoid a race with a forcible unmount. While we have a snapshot vnode unlocked to avoid a deadlock with another inode in the same inode block being updated, the filesystem containing it may be forcibly unmounted. When that happens the snapshot vnode is revoked. We need to check for that condition and fail appropriately.
This change will be included along with 232351 when it is MFC'ed to 9.
Spotted by: kib Reviewed by: kib
|
233627 |
28-Mar-2012 |
mckusick |
Keep track of the mount point associated with a special device to enable the collection of counts of synchronous and asynchronous reads and writes for its associated filesystem. The counts are displayed using `mount -v'.
Ensure that buffers used for paging indicate the vnode from which they are operating so that counts of paging I/O operations from the filesystem are collected.
This checkin only adds the setting of the mount point for the UFS/FFS filesystem, but it would be trivial to add the setting and clearing of the mount point at filesystem mount/unmount time for other filesystems too.
Reviewed by: kib
|
233610 |
28-Mar-2012 |
kib |
Do trivial reformatting of the comment to record the missed commit message for r233609: Restore the writes of atimes, quotas and superblock from syncer vnode.
Noted by: rdivacky
|
233609 |
28-Mar-2012 |
kib |
Reviewed by: bde, mckusick Tested by: pho MFC after: 2 weeks
|
233607 |
28-Mar-2012 |
kib |
Update comment.
MFC after: 3 days
|
233438 |
25-Mar-2012 |
mckusick |
Add a third flags argument to ffs_syncvnode to avoid a possible conflict with MNT_WAIT flags that passed in its second argument. This will be MFC'ed together with r232351.
Discussed with: kib
|
232948 |
13-Mar-2012 |
kib |
Supply boolean as the second argument to ffs_update(), and not a MNT_[NO]WAIT constants, which in fact always caused sync operation.
Based on the submission by: bde Reviewed by: mckusick MFC after: 2 weeks
|
232837 |
11-Mar-2012 |
kib |
Remove superfluous brackets.
Submitted by: alc MFC after: 2 weeks
|
232836 |
11-Mar-2012 |
kib |
Do schedule delayed writes for async mounts. While there, make some style adjustments, like missed () around return values.
Submitted by: bde Reviewed by: mckusick Tested by: pho MFC after: 2 weeks
|
232835 |
11-Mar-2012 |
kib |
Do not fall back to slow synchronous i/o when low on memory or buffers. The bawrite() schedules the write to happen immediately, and its use frees the current thread to do more cleanups.
Submitted by: bde Reviewed by: mckusick Tested by: pho MFC after: 2 weeks
|
232834 |
11-Mar-2012 |
kib |
In ffs_syncvnode(), pass boolean false as second argument of ffs_update(). Synchronous inode block update is not needed for MNT_LAZY callers (syncer), and since waitfor values are not zero, code did unneccessary synchronous update.
Submitted by: bde Reviewed by: mckusick Tested by: pho MFC after: 2 weeks
|
232833 |
11-Mar-2012 |
kib |
Remove not needed ARGSUSED lint command.
Submitted by: bde MFC after: 3 days
|
232732 |
09-Mar-2012 |
pho |
Revert r232692 as the correct place to fix this is at the syscall level.
|
232709 |
09-Mar-2012 |
kib |
Decomission mnt_noasync. Introduce MNTK_NOASYNC mnt_kern_flag which allows a filesystem to request VFS to not allow MNTK_ASYNC.
MFC after: 1 week
|
232692 |
08-Mar-2012 |
pho |
syscall() fuzzing can trigger this panic. Return EINVAL instead.
MFC after: 1 week
|
232351 |
01-Mar-2012 |
mckusick |
This change avoids a kernel deadlock on "snaplk" when using snapshots on UFS filesystems running with journaled soft updates. This is the first of several bugs that need to be fixed before removing the restriction added in -r230250 to prevent the use of snapshots on filesystems running with journaled soft updates.
The deadlock occurs when holding the snapshot lock (snaplk) and then trying to flush an inode via ffs_update(). We become blocked by another process trying to flush a different inode contained in the same inode block that we need. It holds the inode block for which we are waiting locked. When it tries to write the inode block, it gets blocked waiting for the our snaplk when it calls ffs_copyonwrite() to see if the inode block needs to be copied in our snapshot.
The most obvious place that this deadlock arises is in the ffs_copyonwrite() routine when it updates critical metadata in a snapshot and tries to write it out before proceeding. The fix here is to write the data and indirect block pointer for the snapshot, but to skip the call to ffs_update() to write the snapshot inode. To ensure that we will never have to update a pointer in the inode itself, the ffs_snapshot() routine that creates the snapshot has to ensure that all the direct blocks are allocated as part of the creation of the snapshot.
A less obvious place that this deadlock occurs is when we hold the snaplk because we are deleting a snapshot. In the course of doing the deletion, we need to allocate various soft update dependency structures and allocate some journal space. If we hit a resource limit while doing this we decrease the resources in use by flushing out an existing dirty file to get it to give up the soft dependency resources that it holds. The flush can cause an ffs_update() to be done on the inode for the file that we have selected to flush resulting in the same deadlock as described above when the inode that we have chosen to flush resides in the same inode block as the snapshot inode that we hold. The fix is to defer cleaning up any time that the inode on which we are operating is a snapshot.
Help and review by: Jeff Roberson Tested by: Peter Holm MFC (to 9 only) after: 2 weeks
|
231949 |
21-Feb-2012 |
kib |
Fix found places where uio_resid is truncated to int.
Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from the usermode.
Discussed with: bde, das (previous versions) MFC after: 1 month
|
231572 |
13-Feb-2012 |
mckusick |
Missing conditions in checking whether an inode has been written.
Found and tested by: Peter Holm MFC after: 2 weeks (to 9 only)
|
231313 |
09-Feb-2012 |
mckusick |
Historically when an application wrote an entire block of a file, the kernel allocated a buffer but did not zero it as it was about to be completely filled by a uiomove() from the user's buffer. However, if the uiomove() failed, the old contents of the buffer could be exposed especially if the file was being mmap'ed. The fix was to always zero the buffer when it was allocated.
This change first attempts the uiomove() to the newly allocated (and dirty) buffer and only zeros it if the uiomove() fails. The effect is to eliminate the gratuitous zeroing of the buffer in the usual case where the uiomove() successfully fills it.
Reviewed by: kib Tested by: scottl MFC after: 2 weeks (to 9 only)
|
231160 |
07-Feb-2012 |
mckusick |
In the original days of BSD, a sync was issued on every filesystem every 30 seconds. This spike in I/O caused the system to pause every 30 seconds which was quite annoying. So, the way that sync worked was changed so that when a vnode was first dirtied, it was put on a 30-second cleaning queue (see the syncer_workitem_pending queues in kern/vfs_subr.c). If the file has not been written or deleted after 30 seconds, the syncer pushes it out. As the syncer runs once per second, dirty files are trickled out slowly over the 30-second period instead of all at once by a call to sync(2).
The one drawback to this is that it does not cover the filesystem metadata. To handle the metadata, vfs_allocate_syncvnode() is called to create a "filesystem syncer vnode" at mount time which cycles around the cleaning queue being sync'ed every 30 seconds. In the original design, the only things it would sync for UFS were the filesystem metadata: inode blocks, cylinder group bitmaps, and the superblock (e.g., by VOP_FSYNC'ing devvp, the device vnode from which the filesystem is mounted).
Somewhere in its path to integration with FreeBSD the flushing of the filesystem syncer vnode got changed to sync every vnode associated with the filesystem. The result of this change is to return to the old filesystem-wide flush every 30-seconds behavior and makes the whole 30-second delay per vnode useless.
This change goes back to the originally intended trickle out sync behavior. Key to ensuring that all the intended semantics are preserved (e.g., that all inode updates get flushed within a bounded period of time) is that all inode modifications get pushed to their corresponding inode blocks so that the metadata flush by the filesystem syncer vnode gets them to the disk in a timely way. Thanks to Konstantin Belousov (kib@) for doing the audit and commit -r231122 which ensures that all of these updates are being made.
Reviewed by: kib Tested by: scottl MFC after: 2 weeks
|
231091 |
06-Feb-2012 |
kib |
Add missing opt_quota.h include to activate #ifdef QUOTA blocks, apparently a step in unbreaking QUOTA support.
Reported and tested by: Adam Strohl <adams-freebsd ateamsystems com> MFC after: 1 week
|
231077 |
06-Feb-2012 |
kib |
JNEWBLK dependency may legitimately appear on the buf dependency list. If softdep_sync_buf() discovers such dependency, it should do nothing, which is safe as it is only waiting on the parent buffer to be written, so it can be removed.
Committed on behalf of: jeff MFC after: 1 week
|
230250 |
17-Jan-2012 |
mckusick |
There are several bugs/hangs when trying to take a snapshot on a UFS/FFS filesystem running with journaled soft updates. Until these problems have been tracked down, return ENOTSUPP when an attempt is made to take a snapshot on a filesystem running with journaled soft updates.
MFC after: 2 weeks
|
230249 |
17-Jan-2012 |
mckusick |
Make sure all intermediate variables holding mount flags (mnt_flag) and that all internal kernel calls passing mount flags are declared as uint64_t so that flags in the top 32-bits are not lost.
MFC after: 2 weeks
|
230101 |
14-Jan-2012 |
mckusick |
Convert FFS mount error messages from kernel printf's to using the vfs_mount_error error message facility provided by the nmount interface.
Clean up formatting of mount warnings which still need to use kernel printf's since they do not return errors.
Requested by: Craig Rodrigues <rodrigc@crodrigues.org> MFC after: 2 weeks
|
229200 |
01-Jan-2012 |
ed |
Migrate ufs and ext2fs from skpc() to memcchr().
While there, remove a useless check from the code. memcchr() always returns characters unequal to 0xff in this case, so inosused[i] ^ 0xff can never be equal to zero. Also, the fact that memcchr() returns a pointer instead of the number of bytes until the end, makes conversion to an offset far more easy.
|
227382 |
09-Nov-2011 |
gleb |
Use implementation independent inoNN_t scalars for on-disk UFS structures
Approved by: mdf (mentor)
|
227309 |
07-Nov-2011 |
ed |
Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.
The SYSCTL_NODE macro defines a list that stores all child-elements of that node. If there's no SYSCTL_DECL macro anywhere else, there's no reason why it shouldn't be static.
|
225807 |
27-Sep-2011 |
mckusick |
This update eliminates a lock-order reversal warning discovered whle tracking down the system hang reported in kern/160662 and corrected in revision 225806. The LOR is not the cause of the system hang and indeed cannot cause an actual deadlock. However, it can be easily eliminated by defering the acquisition of a buflock until after all the vnode locks have been acquired.
Reported by: Hans Ottevanger PR: kern/160662
|
225806 |
27-Sep-2011 |
mckusick |
This update eliminates the system hang reported in kern/160662 when taking a snapshot on a filesystem running with journaled soft updates.
Reported by: Hans Ottevanger Fix verified by: Hans Ottevanger PR: kern/160662
|
225700 |
20-Sep-2011 |
kib |
Use nowait sync request for a vnode when doing softdep cleanup. We possibly own the unrelated vnode lock, doing waiting sync causes deadlocks.
Reported and tested by: pho Approved by: re (bz)
|
225166 |
25-Aug-2011 |
mm |
Generalize ffs_pages_remove() into vn_pages_remove().
Remove mapped pages for all dataset vnodes in zfs_rezget() using new vn_pages_remove() to fix mmapped files changed by zfs rollback or zfs receive -F.
PR: kern/160035, kern/156933 Reviewed by: kib, pjd Approved by: re (kib) MFC after: 1 week
|
224876 |
15-Aug-2011 |
rwatson |
Fix two cases involving opt_capsicum.h and module builds:
(1) opt_capsicum.h is no longer required in ffs_alloc.c, so remove the #include.
(2) portalfs depends on opt_capsicum.h, so have the Makefile generate one if required.
These affect only modules built without a kernel (i.e, not buildkernel, but yes buildworld if the dubious MODULES_WITH_WORLD is used).
Approved by: re (bz) Sponsored by: Google Inc
|
224778 |
11-Aug-2011 |
rwatson |
Second-to-last commit implementing Capsicum capabilities in the FreeBSD kernel for FreeBSD 9.0:
Add a new capability mask argument to fget(9) and friends, allowing system call code to declare what capabilities are required when an integer file descriptor is converted into an in-kernel struct file *. With options CAPABILITIES compiled into the kernel, this enforces capability protection; without, this change is effectively a no-op.
Some cases require special handling, such as mmap(2), which must preserve information about the maximum rights at the time of mapping in the memory map so that they can later be enforced in mprotect(2) -- this is done by narrowing the rights in the existing max_protection field used for similar purposes with file permissions.
In namei(9), we assert that the code is not reached from within capability mode, as we're not yet ready to enforce namespace capabilities there. This will follow in a later commit.
Update two capability names: CAP_EVENT and CAP_KEVENT become CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they represent.
Approved by: re (bz) Submitted by: jonathan Sponsored by: Google Inc
|
224503 |
30-Jul-2011 |
mckusick |
Update to -r224294 to ensure that only one of MNT_SUJ or MNT_SOFTDEP is set so that mount can revert back to using MNT_NOWAIT when doing getmntinfo.
Approved by: re (kib)
|
224294 |
24-Jul-2011 |
mckusick |
Move the MNTK_SUJ flag in mnt_kern_flag to MNT_SUJ in mnt_flag so that it is visible to userland programs. This change enables the `mount' command with no arguments to be able to show if a filesystem is mounted using journaled soft updates as opposed to just normal soft updates.
Approved by: re (bz)
|
224272 |
22-Jul-2011 |
mckusick |
Default debugging error messages to off for journaled soft updates sysctls. Delete limiting on output of these sysctls.
Approved by: re (kib)
|
224061 |
15-Jul-2011 |
mckusick |
Add an FFS specific mount option to allow a filesystem checker (typically fsck_ffs) to register that it wishes to use FFS specific sysctl's to update the filesystem. This ensures that two checkers cannot run on a given filesystem at the same time and that no other process accidentally or maliciously uses the filesystem updating sysctls inappropriately. This functionality is needed by the journaling soft-updates recovery code.
|
224027 |
14-Jul-2011 |
mckusick |
Consistently check mount flag (MNTK_SUJ) rather than superblock flag (FS_SUJ) when determining whether to do journaling-based operations. The mount flag is set only when journaling is active while the superblock flag is set to indicate that journaling is to be used. For example, when the filesystem is mounted read-only, the journaling may be present (FS_SUJ) but not active (MNTK_SUJ). Inappropriate checking of the FS_SUJ flag was causing some journaling actions to be attempted at inappropriate times.
|
223902 |
10-Jul-2011 |
mckusick |
When first creating snapshots, we may free some blocks within it. These blocks should not have TRIM applied to them.
Submitted by: Kostik Belousov
|
223900 |
10-Jul-2011 |
mckusick |
Allow disk partitions associated with UFS read-only mounted filesystems to be opened for writing. This functionality used to be special-cased for just the root filesystem, but with this change is now available for all UFS filesystems. This change is needed for journaled soft updates recovery.
Discussed with: Jeff Roberson
|
223888 |
09-Jul-2011 |
kib |
Use 'curthread_pflags' instead of 'thread_pflags' to signify that only curthread can be operated upon.
Requested by: attilio MFC after: 1 week
|
223887 |
09-Jul-2011 |
kib |
Use helper functions instead of manually managing TDP_INBDFLUSH.
Sponsored by: The FreeBSD Foundation Reviewed by: alc (previous version) MFC after: 1 week
|
223772 |
04-Jul-2011 |
jeff |
- Speed up pendingblock processing again. Having too much delay between ffs_blkfree() and the pending adjustment causes all kinds of space related problems.
|
223771 |
04-Jul-2011 |
jeff |
- Handle D_JSEGDEP in the softdep_sync_buf() switch. These can now find themselves on snapshot vnodes.
Reported by: pho
|
223770 |
04-Jul-2011 |
jeff |
- It is impossible to run request_cleanup() while doing a copyonwrite. This will most likely cause new block allocations which can recurse into request cleanup. - While here optimize the ufs locking slightly. We need only acquire and drop once. - process_removes() and process_truncates() also is only needed once. - Attempt to flush each item on the worklist once but do not loop forever if some can not be completed.
Discussed with: mckusick
|
223687 |
29-Jun-2011 |
mckusick |
Handle the FREEDEP case in softdep_sync_buf(). This fix failed to get added in -r223325.
Submitted by: Peter Holm
|
223677 |
29-Jun-2011 |
alc |
Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this option to vm_object_page_remove() asserts that the specified range of pages is not mapped, or more precisely that none of these pages have any managed mappings. Thus, vm_object_page_remove() need not call pmap_remove_all() on the pages.
This change not only saves time by eliminating pointless calls to pmap_remove_all(), but it also eliminates an inconsistency in the use of pmap_remove_all() versus related functions, like pmap_remove_write(). It eliminates harmless but pointless calls to pmap_remove_all() that were being performed on PG_UNMANAGED pages.
Update all of the existing assertions on pmap_remove_all() to reflect this change.
Reviewed by: kib
|
223325 |
20-Jun-2011 |
jeff |
- Fix directory count rollbacks by passing the mode to the journal dep earlier. - Add rollback/forward code for frag and cluster accounting. - Handle the FREEDEP case in softdep_sync_buf(). (submitted by pho)
|
223268 |
18-Jun-2011 |
mckusick |
Fixed dereference of a NULL pointer.
Reported by: Peter Holm
|
223169 |
16-Jun-2011 |
mckusick |
Drop the include of <ufs/ffs/ffs_extern.h> from usr.sbin/makefs/ffs/ffs_bswap.c and usr.sbin/makefs/ffs/ffs_subr.c as they have no need of anything in that file. No other programs or libraries include <ufs/ffs/ffs_extern.h> (nor should they as it is totally in-kernel interfaces). For added protection I enclosed the entire contents of <ufs/ffs/ffs_extern.h> in ifdef _KERNEL.
Feedback from: Bruce Evans and Tai-hwa Liang
|
223138 |
16-Jun-2011 |
avatar |
Fixing compilation bustage by introducing another forward declaration.
|
223127 |
15-Jun-2011 |
mckusick |
Ensure that filesystem metadata contained within persistent snapshots is always kept consistent.
Suggested by: Jeff Roberson
|
223114 |
15-Jun-2011 |
mckusick |
With the restructuring of the block reclaimation code, the notification messages for a filesystem being out of space need to be moved so that they do not print out until after a failed cleanup attempt.
Suggested by: Jeff Roberson
|
223105 |
15-Jun-2011 |
mckusick |
Missing cleanup case after completion of a snapshot vnode write claiming a released block.
Submitted by: Jeff Roberson Tested by: Peter Holm
|
223052 |
13-Jun-2011 |
dim |
Use alternative, less messy solution to avoid breakage after r223020: put the snapdata structure between #ifdef _KERNEL guards.
Suggested by: kib
|
223020 |
12-Jun-2011 |
mckusick |
Update to soft updates journaling to properly track freed blocks that get claimed by snapshots.
Submitted by: Jeff Roberson Tested by: Peter Holm
|
223018 |
12-Jun-2011 |
mckusick |
Disable the soft updates journaling after a filesystem is successfully downgraded to read-only. It will be restarted if the filesystem is upgraded back to read-write.
|
222958 |
10-Jun-2011 |
jeff |
Implement fully asynchronous partial truncation with softupdates journaling to resolve errors which can cause corruption on recovery with the old synchronous mechanism.
- Append partial truncation freework structures to indirdeps while truncation is proceeding. These prevent new block pointers from becoming valid until truncation completes and serialize truncations. - On completion of a partial truncate journal work waits for zeroed pointers to hit indirects. - softdep_journal_freeblocks() handles last frag allocation and last block zeroing. - vtruncbuf/ffs_page_remove moved into softdep_*_freeblocks() so it is only implemented in one place. - Block allocation failure handling moved up one level so it does not proceed with buf locks held. This permits us to do more extensive reclaims when filesystem space is exhausted. - softdep_sync_metadata() is broken into two parts, the first executes once at the start of ffs_syncvnode() and flushes truncations and inode dependencies. The second is called on each locked buf. This eliminates excessive looping and rollbacks. - Improve the mechanism in process_worklist_item() that handles acquiring vnode locks for handle_workitem_remove() so that it works more generally and does not loop excessively over the same worklist items on each call. - Don't corrupt directories by zeroing the tail in fsck. This is only done for regular files. - Push a fsync complete record for files that need it so the checker knows a truncation in the journal is no longer valid.
Discussed with: mckusick, kib (ffs_pages_remove and ffs_truncate parts) Tested by: pho
|
222724 |
05-Jun-2011 |
mckusick |
Grammer fix in comment.
Eliminate one (of several) possible conflicting buffer locks when trying to reclaim blocks. Rest of fix to be incorporated as part of SUJ update by jeff.
Pointed out by: Kostik Belousov
|
222422 |
28-May-2011 |
mckusick |
Due to a lag in updating the fs_pendinginodes count, we cannot depend on it to decide whether we should try to reclaim inodes when we run short.
Discovered by: Peter Holm
|
222334 |
26-May-2011 |
mckusick |
The check for whether a block is going to be claimed by a snapshot needs to happen before we notify the underlying layer that it is being freed.
|
222167 |
22-May-2011 |
rmacklem |
Add a lock flags argument to the VFS_FHTOVP() file system method, so that callers can indicate the minimum vnode locking requirement. This will allow some file systems to choose to return a LK_SHARED locked vnode when LK_SHARED is specified for the flags argument. This patch only adds the flag. It does not change any file system to use it and all callers specify LK_EXCLUSIVE, so file system semantics are not changed.
Reviewed by: kib
|
221829 |
13-May-2011 |
mdf |
Use a name instead of a magic number for kern_yield(9) when the priority should not change. Fetch the td_user_pri under the thread lock. This is probably not necessary but a magic number also seems preferable to knowing the implementation details here.
Requested by: Jason Behmer < jason DOT behmer AT isilon DOT com >
|
221281 |
30-Apr-2011 |
kib |
Fix typos.
Noted by: Fabian Keil <freebsd-listen fabiankeil de> Pointy hat to: kib MFC after: 1 week
|
221261 |
30-Apr-2011 |
kib |
Clarify the comment.
MFC after: 1 week
|
220985 |
24-Apr-2011 |
kib |
VFS sometimes is unable to inactivate a vnode when vnode use count goes to zero. E.g., the vnode might be only shared-locked at the time of vput() call. Such vnodes are kept in the hash, so they can be found later.
If ffs_valloc() allocated an inode that has its vnode cached in hash, and still owing the inactivation, then vget() call from ffs_valloc() clears VI_OWEINACT, and then the vnode is reused for the newly allocated inode.
The problem is, the vnode is not reclaimed before it is put to the new use. ffs_valloc() recycles vnode vm object, but this is not enough. In particular, at least v_vflag should be cleared, and several bits of UFS state need to be removed.
It is very inconvenient to call vgone() at this point. Instead, move some parts of ufs_reclaim() into helper function ufs_prepare_reclaim(), and call the helper from VOP_RECLAIM and ffs_valloc().
Reviewed by: mckusick Tested by: pho MFC after: 3 weeks
|
220532 |
11-Apr-2011 |
jeff |
- Refactor softdep_setup_freeblocks() into a set of functions to prepare for a new journal specific partial truncate routine. - Use dep_current[] in place of specific dependency counts. This is automatically maintained when workitems are allocated and has less risk of becoming incorrect.
|
220511 |
10-Apr-2011 |
jeff |
Fix a long standing SUJ performance problem:
- Keep a hash of indirect blocks that have recently been freed and are still referenced in the journal. - Lookup blocks in this hash before forcing a new block write to wait on the journal entry to hit the disk. This is only necessary to avoid confusion between old identities as indirects and new identities as file blocks. - Don't free jseg structures until the journal has written a record that invalidates it. This keeps the indirect block information around for as long as is required to be safe. - Force an empty journal block write when required to flush out stale journal data that is simply waiting for the oldest valid sequence number to advance beyond it.
|
220406 |
07-Apr-2011 |
jeff |
- Don't invalidate jnewblks immediately upon discovering that the block will be removed. Permit the journal to proceed so that we don't leave a rollback in a cg for a very long time as this can cause terrible perf problems in low memory situations.
Tested by: pho
|
220374 |
05-Apr-2011 |
mckusick |
Be far more persistent in reclaiming blocks and inodes before giving up and declaring a filesystem out of space. Especially necessary when running on a small filesystem. With this improvement, it should be possible to use soft updates on a small root filesystem.
Kudos to: Peter Holm Testing by: Peter Holm MFC: 2 weeks
|
220282 |
02-Apr-2011 |
jeff |
Fix problems that manifested from filesystem full conditions:
- In softdep_revert_mkdir() find the dotaddref before we attempt to cancel the jaddref so we can make assumptions about where the dotaddref is on the list. cancel_jaddref() does not always remove items from the list anymore. - Always set GOINGAWAY on an inode in softdep_freefile() if DEPCOMPLETE was never set. This ensures that dependencies will continue to be processed on the inowait/bufwait list and is more an artifact of the structure of the code than a pure ordering problem. - Always set DEPCOMPLETE on canceled jaddrefs so that they can be freed appropriately. This normally occurs when the refs are added to the journal but if they are canceled before this point the state would never be set and the dependency could never be freed.
Reported by: pho Tested by: pho
|
220099 |
28-Mar-2011 |
kib |
Fix the softdep_request_cleanup() function definition for !SOFTUPDATES case.
Submitted by: Aleksandr Rybalko <ray dlink ua>
|
219895 |
23-Mar-2011 |
mckusick |
Add retry code analogous to the block allocation retry code to avoid running out of inodes.
Reported by: Peter Holm
|
219804 |
20-Mar-2011 |
kib |
Retire opt_ffs_broken_fixme.h. Instead of directly calling ffs_snapgone(), use UFS_SNAPGONE() with usual layering.
Requested by: bde MFC after: 1 week
|
219276 |
04-Mar-2011 |
jhb |
Use ffs() to locate free bits in the inode bitmap rather than a loop with bit shifts.
Reviewed by: mckusick MFC after: 1 month
|
218602 |
12-Feb-2011 |
kib |
Use the native sector size of the device backing the UFS volume for SU+J journal blocks, instead of hard coding 512 byte sector size. Journal need to atomically write the block, that can only be guaranteed at the device sector size, not larger. Attempt to write less then sector size results in driver errors.
Note that this is the first structure in UFS that depends on the sector size. Other elements are written in the units of fragments.
In collaboration with: pho Reviewed by: jeff Tested by: bz, pho
|
218485 |
09-Feb-2011 |
netchild |
Add some FEATURE macros for some UFS features.
SU+J is not included as a FEATURE macro: - it was not in the tree during the GSoC - I do not see an option to en-/disable it in NOTES
Two minor changes where made during the review compared to what was developed during GSoC 2010.
No FreeBSD version bump, the userland application to query the features will be committed last and can serve as an indication of the availablility if needed.
Sponsored by: Google Summer of Code 2010 Submitted by: kibab Reviewed by: kib X-MFC after: to be determined in last commit with code from this project
|
218424 |
08-Feb-2011 |
mdf |
Based on discussions on the svn-src mailing list, rework r218195:
- entirely eliminate some calls to uio_yeild() as being unnecessary, such as in a sysctl handler.
- move should_yield() and maybe_yield() to kern_synch.c and move the prototypes from sys/uio.h to sys/proc.h
- add a slightly more generic kern_yield() that can replace the functionality of uio_yield().
- replace source uses of uio_yield() with the functional equivalent, or in some cases do not change the thread priority when switching.
- fix a logic inversion bug in vlrureclaim(), pointed out by bde@.
- instead of using the per-cpu last switched ticks, use a per thread variable for should_yield(). With PREEMPTION, the only reasonable use of this is to determine if a lock has been held a long time and relinquish it. Without PREEMPTION, this is essentially the same as the per-cpu variable.
|
218195 |
02-Feb-2011 |
mdf |
Put the general logic for being a CPU hog into a new function should_yield(). Use this in various places. Encapsulate the common case of check-and-yield into a new function maybe_yield().
Change several checks for a magic number of iterations to use should_yield() instead.
MFC after: 1 week
|
217326 |
12-Jan-2011 |
mdf |
sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly.
Commit the kernel changes.
|
216951 |
04-Jan-2011 |
kib |
Instead of incrementing freework reference counter in indir_trunc(), do it at the allocation time for journaled fs and indirect blocks, when the allocated object is not accessible outside.
Requested and reviewed by: jeff Tested by: pho
|
216818 |
30-Dec-2010 |
kib |
Handle missing jremrefs when a directory is renamed overtop of another, deleting it. If the directory is removed, UFS always need to remove the .. ref, even if the ultimate ref on the parent would not change. The new directory must have a new journal entry for that ref. Otherwise journal processing would not properly account for the parent's reference since it will belong to a removed directory entry.
Change ufs_rename()'s dotdot rename section to always setup_dotdot_link(). In the tip != NULL case SUJ needs the newref dependency allocated via setup_dotdot_link().
Stop setting isrmdir to 2 for newdirrem() in softdep_setup_remove(). Remove the isdirrem > 1 checks from newdirrem().
Reported by: many Submitted by: jeff Tested by: pho
|
216817 |
30-Dec-2010 |
kib |
In indir_trunc(), when processing jnewblk entries that are not written to the disk, recurse to handle indirect blocks of next level that are hidden by the corresponding entry.
In collaboration with: pho Reviewed by: jeff, mckusick Tested by: mckusick, pho
|
216796 |
29-Dec-2010 |
kib |
Add kernel side support for BIO_DELETE/TRIM on UFS.
The FS_TRIM fs flag indicates that administrator requested issuing of TRIM commands for the volume. UFS will only send the command to disk if the disk reports GEOM::candelete attribute.
Since disk queue is reordered, data block is marked as free in the bitmap only after TRIM command completed. Due to need to sleep waiting for i/o to finish, TRIM bio_done routine schedules taskqueue to set the bitmap bit.
Based on the patch by: mckusick Reviewed by: mckusick, pjd Tested by: pho MFC after: 1 month
|
216795 |
29-Dec-2010 |
kib |
Move the definition of mkdirlisthd from header to C file.
Reviewed by: mckusick Tested by: pho
|
216676 |
23-Dec-2010 |
mckusick |
This patch fixes a soft update panic while running perl 5.12 tests which produced:
panic: indir_trunc: Index out of range -148 parent -2061 lbn -305164
Reported by: Dimitry Andric Fixed by: Jeff Roberson
|
216099 |
01-Dec-2010 |
kib |
Journal start looks up .sujournal file by doing lookup on the root dvp. As result, failed softdep_mount() might leave up to two vnodes on the mp mountlist, preventing mnt_ref from going to zero.
Call ffs_flushfiles() after failed softdep_mount() to clean mountlist.
Initial report by: Garrett Cooper Reproduced and tested by: pho
|
215950 |
27-Nov-2010 |
pho |
First step in fixing the handle_workitem_freeblocks panic.
In collaboration with: kib
|
215576 |
20-Nov-2010 |
mckusick |
Delete /sys/ufs/ffs/README.snapshot as it is no longer relevant. Drop reference to it in mount(8).
MFC: 3 days
|
215117 |
11-Nov-2010 |
kib |
The softdep_setup_freeblocks() adds worklist items before deallocate_dependencies() is done. This opens a race between softdep thread and the thread that does the truncation: A write of the indirect block causes the freeblks to become ALLCOMPLETE while softdep_setup_freeblocks() dropped softdep lock. And then, softdep_disk_write_complete() would reassign the workitem to the mount point worklist, causing premature processing of the workitem, or journal write exhaust the fb_jfreeblkhd and handle_written_jfreeblk does the same reassign. indir_trunc() then would find the indirect block that is locked (with lock owned by kernel) but without any dependencies, causing it to hang in getblk() waiting for buffer lock.
Do not mark freeblks as DEPCOMPLETE until deallocate_dependencies() finished.
Analyzed, suggested and reviewed by: jeff Tested by: pho
|
215115 |
11-Nov-2010 |
kib |
Change #ifdef INVARIANTS panic into KASSERT, and print some useful information to diagnose the issue, in handle_complete_freeblocks().
Reviewed by: jeff Tested by: pho
|
215114 |
11-Nov-2010 |
kib |
In journal_mount(), only set MNTK_SUJ flag after the jblocks are mapped. I believe there is a window otherwise where jblocks can be accessed without proper initialization.
Reviewed by: jeff Tested by: pho
|
215113 |
11-Nov-2010 |
kib |
Add function lbn_offset to calculate offset of the indirect block of given level.
Reviewed by: jeff Tested by: pho
|
215112 |
11-Nov-2010 |
kib |
Fix typo. Function is called ffs_blkfree.
|
213664 |
10-Oct-2010 |
kib |
The r184588 changed the layout of struct export_args, causing an ABI breakage for old mount(2) syscall, since most struct <filesystem>_args embed export_args. The mount(2) is supposed to provide ABI compatibility for pre-nmount mount(8) binaries, so restore ABI to pre-r184588.
Requested and reviewed by: bde MFC after: 2 weeks
|
213363 |
02-Oct-2010 |
alc |
M_USE_RESERVE has been deprecated for a decade. Eliminate any uses that have no run-time effect.
|
213275 |
29-Sep-2010 |
mckusick |
Since local variable 'i' is used only in a KASSERT, declare and initialize it only if INVARIANTS is defined to avoid a declared but unused warning.
Suggested by: Brian Somers <brian@FreeBSD.org>
|
213259 |
29-Sep-2010 |
kib |
Fix typo in comment.
|
212788 |
17-Sep-2010 |
obrien |
Correct some non-code typos.
|
212617 |
14-Sep-2010 |
mckusick |
Update comments in soft updates code to more fully describe the addition of journalling. Only functional change is to tighten a KASSERT.
Reviewed by: jeff Roberson
|
211531 |
20-Aug-2010 |
jhb |
Add dedicated routines to toggle lockmgr flags such as LK_NOSHARE and LK_CANRECURSE after a lock is created. Use them to implement macros that otherwise manipulated the flags directly. Assert that the associated lockmgr lock is exclusively locked by the current thread when manipulating these flags to ensure the flag updates are safe. This last change required some minor shuffling in a few filesystems to exclusively lock a brand new vnode slightly earlier.
Reviewed by: kib MFC after: 3 days
|
211212 |
12-Aug-2010 |
kib |
Softdep_process_worklist() should unsuspend not only before processing the worklist (in softdep_process_journal), but also after flushing the workitems. Might be, we should even do this before bwillwrite() too, but this seems to be not needed for now.
Fs might be suspended during processing the queue, and then there is nobody around to unsuspend.
In collaboration with: pho Tested by: bz Reviewed by: jeff
|
210172 |
16-Jul-2010 |
jhb |
Revert the previous commit. The race is not applicable to the lockmgr implementation in 8.0 and later as its flags field does not hold dynamic state such as waiters flags, but is only modified in lockinit() aside from VN_LOCK_*().
Discussed with: attilio
|
210171 |
16-Jul-2010 |
jhb |
When the MNTK_EXTENDED_SHARED mount option was added, some filesystems were changed to defer the setting of VN_LOCK_ASHARE() (which clears LK_NOSHARE in the vnode lock's flags) until after they had determined if the vnode was a FIFO. This occurs after the vnode has been inserted a VFS hash or some similar table, so it is possible for another thread to find this vnode via vget() on an i-node number and block on the vnode lock. If the lockmgr interlock (vnode interlock for vnode locks) is not held when clearing the LK_NOSHARE flag, then the lk_flags field can be clobbered. As a result the thread blocked on the vnode lock may never get woken up. Fix this by holding the vnode interlock while modifying the lock flags in this case.
MFC after: 3 days
|
209717 |
06-Jul-2010 |
jeff |
- Handle the truncation of an inode with an effective link count of 0 in the context of the process that reduced the effective count. Previously all truncation as a result of unlink happened in the softdep flush thread. This had the effect of being impossible to rate limit properly with the journal code. Now the process issuing unlinks is suspended when the journal files. This has a side-effect of improving rm performance by allowing more concurrent work. - Handle two cases in inactive, one for effnlink == 0 and another when nlink finally reaches 0. - Eliminate the SPACECOUNTED related code since the truncation is no longer delayed.
Discussed with: mckusick
|
209057 |
11-Jun-2010 |
avg |
ffs_softdep: change K&R in function defintions to ANSI prototypes
Apparently it's bad when we first have an ANSI prototype in function declaration, but then use K&R in its defintion.
Complaint from: clang MFC after: 2 weeks
|
208293 |
19-May-2010 |
avg |
ffs_mount: accept and drop userland-only options that can be passed from loader(8)
In r193192 loader(8) has grown an ability to pass root mount options from fstab via vfs.root.mountfrom.options. Unfortunately, some options that can be present in fstab are for userland only and lead to root mounting failure when seen by kernel. Rather than teaching loader about FFS-specific options that should be filtered out, ffs_mount recognizes those options as valid, but ignores and deletes[1] them.
[1] is suggested by jh.
PR: kern/141050 Reported by: many Reviewed by: jh, bde MFC after: 4 days
|
208287 |
19-May-2010 |
jeff |
- Don't immediately re-run softdepflush if we didn't make any progress on the last iteration. This can lead to a deadlock when we have worklist items that cannot be immediately satisfied.
Reported by: uqs, Dimitry Andric <dimitry@andric.com>
- Remove some unnecessary debugging code and place some other under SUJ_DEBUG. - Examine the journal state in softdep_slowdown(). - Re-format some comments so I may more easily add flag descriptions.
|
207742 |
07-May-2010 |
jeff |
- Call softdep_prealloc() before any of the balloc routines in the snapshot code. - Don't fsync() vnodes in prealloc if copy on write is in progress. It is not safe to recurse back into the write path here.
Reported by: Vladimir Grebenschikov <vova@fbsd.ru>
|
207741 |
07-May-2010 |
jeff |
- Use the correct flag mask when determining whether an inode has successfully made it to the free list yet or not. This fixes a deadlock that can occur with unlinked but referenced files. Journal space and inodedeps were not correctly reclaimed because the inode block was not left dirty.
Tested/Reported by: lwindschuh@googlemail.com
|
207728 |
06-May-2010 |
alc |
Eliminate page queues locking around most calls to vm_page_free().
|
207669 |
05-May-2010 |
alc |
Acquire the page lock around all remaining calls to vm_page_free() on managed pages that didn't already have that lock held. (Freeing an unmanaged page, such as the various pmaps use, doesn't require the page lock.)
This allows a change in vm_page_remove()'s locking requirements. It now expects the page lock to be held instead of the page queues lock. Consequently, the page queues lock is no longer required at all by callers to vm_page_rename().
Discussed with: kib
|
207662 |
05-May-2010 |
trasz |
Move checking against RLIMIT_FSIZE into one place, vn_rlimit_fsize().
Reviewed by: kib
|
207366 |
29-Apr-2010 |
avg |
ffs_vfsops: restore alphabetic order of options in ffs_opts
The order was not correct only for nfsv4acls. ("no" prefix is ignored)
MFC after: 1 week
|
207310 |
28-Apr-2010 |
jeff |
- When canceling jaddrefs they may not yet be in the journal if this is via a revert call. In this case don't attempt to remove something that has not yet been added. Otherwise this jaddref must hang around to prevent the bitmap write as normal.
|
207309 |
28-Apr-2010 |
jeff |
- Fix builds without SOFTUPDATES defined in the kernel config.
|
207142 |
24-Apr-2010 |
pjd |
Fix build for UFS without SOFTUPDATES.
|
207141 |
24-Apr-2010 |
jeff |
- Merge soft-updates journaling from projects/suj/head into head. This brings in support for an optional intent log which eliminates the need for background fsck on unclean shutdown.
Sponsored by: iXsystems, Yahoo!, and Juniper. With help from: McKusick and Peter Holm
|
206128 |
03-Apr-2010 |
avg |
ffs_mount: remove redundant assignment of geom consumer to devvp.v_bufobj
The assignment is already done in g_vfs_open. Redundant assignment is harmless, but can become a problem if g_vfs_open logic is changed.
MFC after: 1 week
|
203818 |
13-Feb-2010 |
kib |
When ffs_realloccg() failed to allocate bigger fragment and, because pending blocks are scheduled for removal, goes to retry the (re)allocation, clear the bp pointer. It might happen that meantime free space is really exhausted and we are entering nospace: label without bread()ing buffer, causing stale bp value to be brelse()d again.
Tested by: pho (Producing a scenario to reliably reproduce the race appeared to be much harder then fixing the bug) MFC after: 1 week
|
203784 |
11-Feb-2010 |
mckusick |
One last pass to get all the unsigned comparisons correct.
|
203763 |
10-Feb-2010 |
mckusick |
This fix corrects a problem in the file system that treats large inode numbers as negative rather than unsigned. For a default (16K block) file system, this bug began to show up at a file system size above about 16Tb.
To fully handle this problem, newfs must be updated to ensure that it will never create a filesystem with more than 2^32 inodes. That patch will be forthcoming soon.
Reported by: Scott Burns, John Kilburg, Bruce Evans Followup by: Jeff Roberson PR: 133980 MFC after: 2 weeks
|
203761 |
10-Feb-2010 |
trasz |
Remove unused variable.
|
202125 |
11-Jan-2010 |
mckusick |
Cast 64-bit quantity to intptr_t rather than int so as to work properly with 64-bit architectures (such as amd64).
Reported by: bz
|
202113 |
11-Jan-2010 |
mckusick |
Background:
When renaming a directory it passes through several intermediate states. First its new name will be created causing it to have two names (from possibly different parents). Next, if it has different parents, its value of ".." will be changed from pointing to the old parent to pointing to the new parent. Concurrently, its old name will be removed bringing it back into a consistent state. When fsck encounters an extra name for a directory, it offers to remove the "extraneous hard link"; when it finds that the names have been changed but the update to ".." has not happened, it offers to rewrite ".." to point at the correct parent. Both of these changes were considered unexpected so would cause fsck in preen mode or fsck in background mode to fail with the need to run fsck manually to fix these problems. Fsck running in preen mode or background mode now corrects these expected inconsistencies that arise during directory rename. The functionality added with this update is used by fsck running in background mode to make these fixes.
Solution:
This update adds three new fsck sysctl commands to support background fsck in correcting expected inconsistencies that arise from incomplete directory rename operations. They are:
setcwd(dirinode) - set the current directory to dirinode in the filesystem associated with the snapshot. setdotdot(oldvalue, newvalue) - Verify that the inode number for ".." in the current directory is oldvalue then change it to newvalue. unlink(nameptr, oldvalue) - Verify that the inode number associated with nameptr in the current directory is oldvalue then unlink it.
As with all other fsck sysctls, these new ones may only be used by processes with appropriate priviledge.
Reported by: jeff Security issues: rwatson
|
201758 |
07-Jan-2010 |
mbr |
Remove extraneous semicolons, no functional changes.
Submitted by: Marc Balmer <marc@msys.ch> MFC after: 1 week
|
200796 |
21-Dec-2009 |
trasz |
Implement NFSv4 ACL support for UFS.
Reviewed by: rwatson
|
200770 |
21-Dec-2009 |
kib |
VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object flag. Besides providing the redundand information, need to update both vnode and object flags causes more acquisition of vnode interlock. OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects.
Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for vnode-backed vm objects.
Suggested and reviewed by: alc Tested by: pho MFC after: 3 weeks
|
196920 |
07-Sep-2009 |
kib |
insmntque_stddtr() clears vp->v_data and resets vp->v_op to dead_vnodeops before calling vgone(). Revert r189706 and corresponding part of the r186560.
Noted and reviewed by: tegge Approved by: des (pseudofs part) MFC after: 3 days
|
196888 |
06-Sep-2009 |
kib |
The clear_remove() and clear_inodedeps() call vn_start_write(NULL, &mp, V_NOWAIT) on the non-busied mount point. Unmount might free ufs-specific mp data, causing ffs_vgetf() to access freed memory.
Busy mountpoint before dropping softdep lk.
Noted and reviewed by: tegge Tested by: pho MFC after: 1 week
|
196206 |
14-Aug-2009 |
kib |
When a UFS node is truncated to the zero length, e.g. by explicit truncate(2) call, or by being removed or truncated on open, either new softupdate freeblks structure is allocated to track the freed blocks of the node, or truncation is done syncronously when too many SU dependencies are accumulated. The decision does not take into account the allocated freeblks dependencies, allowing workloads that do huge amount of truncations to exhaust the kernel memory.
Take the number of allocated freeblks into consideration for softdep_slowdown().
Reported by: pluknet gmail com Diagnosed and tested by: pho Approved by: re (rwatson) MFC after: 1 month
|
195294 |
02-Jul-2009 |
kib |
In vn_vget_ino() and their inline equivalents, mnt_ref() the mount point around the sequence that drop vnode lock and then busies the mount point. Not having vlocked node or direct reference to the mp allows for the forced unmount to proceed, making mp unmounted or reused.
Tested by: pho Reviewed by: jeff Approved by: re (kensmith) MFC after: 2 weeks
|
195265 |
01-Jul-2009 |
trasz |
Don't panic on attempt to set ACL on a block device file. This is just a part of kern/125613.
PR: kern/125613 Submitted by: Jaakko Heinonen <jh at saunalahti dot fi> Reviewed by: rwatson Approved by: re (kib)
|
195187 |
30-Jun-2009 |
kib |
For SU mounts, softdep_fsync() might drop vnode lock, allowing other threads to put dirty buffers on the vnode bufobj list. For regular files and synchronous fsync requests, check for the condition and restart the fsync vop if a new dirty buffer arrived.
Tested by: pho Approved by: re (kensmith) MFC after: 1 month
|
195186 |
30-Jun-2009 |
kib |
Softdep_fsync() may need to lock parent directory of the synced vnode. Use inlined (due to FFSV_FORCEINSMQ) version of vn_vget_ino() to prevent mountpoint from being unmounted and freed while no vnodes are locked.
Tested by: pho Approved by: re (kensmith) MFC after: 1 month
|
193511 |
05-Jun-2009 |
rwatson |
Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include.
Discussed with: pjd
|
193307 |
02-Jun-2009 |
attilio |
Handle lock recursion differenty by always checking against LO_RECURSABLE instead the lock own flag itself.
Tested by: pho
|
192260 |
17-May-2009 |
alc |
Introduce vfs_bio_set_valid() and use it from ffs_realloccg(). This eliminates the misuse of vfs_bio_clrbuf() by ffs_realloccg().
In collaboration with: tegge
|
191990 |
11-May-2009 |
attilio |
Remove the thread argument from the FSD (File-System Dependent) parts of the VFS. Now all the VFS_* functions and relating parts don't want the context as long as it always refers to curthread.
In some points, in particular when dealing with VOPs and functions living in the same namespace (eg. vflush) which still need to be converted, pass curthread explicitly in order to retain the old behaviour. Such loose ends will be fixed ASAP.
While here fix a bug: now, UFS_EXTATTR can be compiled alone without the UFS_EXTATTR_AUTOSTART option.
VFS KPI is heavilly changed by this commit so thirdy parts modules needs to be recompiled. Bump __FreeBSD_version in order to signal such situation.
|
190888 |
10-Apr-2009 |
rwatson |
Remove VOP_LEASE and supporting functions. This hasn't been used since the removal of NQNFS, but was left in in case it was required for NFSv4. Since our new NFSv4 client and server can't use it for their requirements, GC the old mechanism, as well as other unused lease- related code and interfaces.
Due to its impact on kernel programming and binary interfaces, this change should not be MFC'd.
Proposed by: jeff Reviewed by: jeff Discussed with: rmacklem, zach loafman @ isilon
|
190690 |
04-Apr-2009 |
kib |
When removing or renaming snaphost, do not delve into request_cleanup(). The later may need blocks from the underlying device that belongs to normal files, that should not be locked while snap lock is held.
Reported and tested by: pho MFC after: 1 month
|
190469 |
27-Mar-2009 |
kib |
Correct typo.
Noted by: kensmith
|
189878 |
16-Mar-2009 |
kib |
Fix two issues with bufdaemon, often causing the processes to hang in the "nbufkv" sleep.
First, ffs background cg group block write requests a new buffer for the shadow copy. When ffs_bufwrite() is called from the bufdaemon due to buffers shortage, requesting the buffer deadlock bufdaemon. Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk to not block while allocating the buffer, and return failure instead. Add a flag argument to the geteblk to allow to pass the flags to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer allocation failed and either GB_NOWAIT_BD is specified, or geteblk() is called from bufdaemon (or its helper, see below). In ffs_bufwrite(), fall back to synchronous cg block write if shadow block allocation failed.
Since r107847, buffer write assumes that vnode owning the buffer is locked. The second problem is that buffer cache may accumulate many buffers belonging to limited number of vnodes. With such workload, quite often threads that own the mentioned vnodes locks are trying to read another block from the vnodes, and, due to buffer cache exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make any substantial progress because the vnodes are locked.
Allow the threads owning vnode locks to help the bufdaemon by doing the flush pass over the buffer cache before getnewbuf() is going to uninterruptible sleep. Move the flushing code from buf_daemon() to new helper function buf_do_flush(), that is called from getnewbuf(). The number of buffers flushed by single call to buf_do_flush() from getnewbuf() is limited by new sysctl vfs.flushbufqtarget. Prevent recursive calls to buf_do_flush() by marking the bufdaemon and threads that temporarily help bufdaemon by TDP_BUFNEED flag.
In collaboration with: pho Reviewed by: tegge (previous version) Tested by: glebius, yandex ... MFC after: 3 weeks
|
189737 |
12-Mar-2009 |
kib |
The non-modifying EA VOPs are executed with only shared vnode lock taken. Provide a custom lock around initializing and tearing down EA area, to prevent both memory leaks and double-free of it. Count the number of EA area accessors.
Lock protocol requires either holding exclusive vnode lock to modify i_ea_area, or shared vnode lock and owning IN_EA_LOCKED flag in i_flag.
Noted by: YAMAMOTO, Taku <taku tackymt homeip net> Tested by: pho (previous version) MFC after: 2 weeks
|
189706 |
11-Mar-2009 |
kib |
Do not double-free the struct inode when insmntque failed. Default insmntque destructor reclaims the vnode, and ufs_reclaim frees the memory.
Reviewed by: tegge MFC after: 3 days
|
189696 |
11-Mar-2009 |
jhb |
Add a new internal mount flag (MNTK_EXTENDED_SHARED) to indicate that a filesystem supports additional operations using shared vnode locks. Currently this is used to enable shared locks for open() and close() of read-only file descriptors. - When an ISOPEN namei() request is performed with LOCKSHARED, use a shared vnode lock for the leaf vnode only if the mount point has the extended shared flag set. - Set LOCKSHARED in vn_open_cred() for requests that specify O_RDONLY but not O_CREAT. - Use a shared vnode lock around VOP_CLOSE() if the file was opened with O_RDONLY and the mountpoint has the extended shared flag set. - Adjust md(4) to upgrade the vnode lock on the vnode it gets back from vn_open() since it now may only have a shared vnode lock. - Don't enable shared vnode locks on FIFO vnodes in ZFS and UFS since FIFO's require exclusive vnode locks for their open() and close() routines. (My recent MPSAFE patches for UDF and cd9660 already included this change.) - Enable extended shared operations on UFS, cd9660, and UDF.
Submitted by: ups Reviewed by: pjd (ZFS bits) MFC after: 1 month
|
189595 |
09-Mar-2009 |
jhb |
Adjust some variables (mostly related to the buffer cache) that hold address space sizes to be longs instead of ints. Specifically, the follow values are now longs: runningbufspace, bufspace, maxbufspace, bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace, hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a relatively small number (~ 44000) of buffers set in kern.nbuf would result in integer overflows resulting either in hangs or bogus values of hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see such problems. There was a check for a nbuf setting that would cause overflows in the auto-tuning of nbuf. I've changed it to always check and cap nbuf but warn if a user-supplied tunable would cause overflow.
Note that this changes the ABI of several sysctls that are used by things like top(1), etc., so any MFC would probably require a some gross shims to allow for that.
MFC after: 1 month
|
188956 |
23-Feb-2009 |
trasz |
Right now, when trying to unmount a device that's already gone, msdosfs_unmount() and ffs_unmount() exit early after getting ENXIO. However, dounmount() treats ENXIO as a success and proceeds with unmounting. In effect, the filesystem gets unmounted without closing GEOM provider etc.
Reviewed by: kib Approved by: rwatson (mentor) Tested by: dho Sponsored by: FreeBSD Foundation
|
188954 |
23-Feb-2009 |
trasz |
Refactor, moving error checking outside of the 'if (mp->mnt_flag & MNT_SOFTDEP)' conditional. No functional changes.
Reviewed by: kib Approved by: rwatson (mentor) Tested by: pho Sponsored by: FreeBSD Foundation
|
188501 |
11-Feb-2009 |
jhb |
- If the g_access() call for the initial root mount fails, then fully cleanup. Before the GEOM consumer would not have been closed. - Bump the reference on the character device being mounted while the associated devfs vnode is locked.
Reviewed by: kib
|
188240 |
06-Feb-2009 |
trasz |
When a device containing mounted UFS filesystem disappears, the type of devvp becomes VBAD, which UFS incorrectly interprets as snapshot vnode, which in turns causes panic. Fix it by replacing '!= VCHR' with '== VREG'.
With this fix in place, you should no longer be able to panic the system by removing a device with an UFS filesystem mounted from it - assuming you don't use softupdates.
Reviewed by: kib Tested by: pho Approved by: rwatson (mentor) Sponsored by: FreeBSD Foundation
|
187894 |
29-Jan-2009 |
trasz |
Make sure the cdev doesn't go away while the filesystem is still mounted. Otherwise dev2udev() could return garbage.
Reviewed by: kib Approved by: rwatson (mentor) Sponsored by: FreeBSD Foundation
|
187790 |
27-Jan-2009 |
rwatson |
Following a fair amount of real world experience with ACLs and extended attributes since FreeBSD 5, make the following semantic changes:
- Don't update the inode modification time (mtime) when extended attributes (and hence also ACLs) are added, modified, or removed. - Don't update the inode access tie (atime) when extended attributes (and hence also ACLs) are queried.
This means that rsync (and related tools) won't improperly think that the data in the file has changed when only the ACL has changed.
Note that ffs_reallocblks() has not been changed to not update on an IO_EXT transaction, but currently EAs don't use the cluster write routines so this shouldn't be a problem. If EAs grow support for clustering, then VOP_REALLOCBLKS() will need to grow a flag argument to carry down IO_EXT to UFS.
MFC after: 1 week PR: ports/125739 Reported by: Alexander Zagrebin <alexz@visp.ru> Tested by: pluknet <pluknet@gmail.com>, Greg Byshenk <freebsd@byshenk.net> Discussed with: kib, kientzle, timur, Alexander Bokovoy <ab@samba.org>
|
187490 |
20-Jan-2009 |
kib |
The r187467 should remove all pages for V_NORMAL case too, because indirect block pages are not removed by the mentioned invocation of the vnode_pager_setsize().
Put a common code into the helper function ffs_pages_remove().
Reported and tested by: dchagin Reviewed by: ups MFC after: 3 weeks
|
187468 |
20-Jan-2009 |
kib |
When extending inode size, we call vnode_pager_setsize(), to have a address space where to put vnode pages, and then call UFS_BALLOC(), to actually allocate new block and map it. When UFS_BALLOC() returns error, sometimes we forget to revert the vm object size increase, allowing for the pages that are not backed by the logical disk blocks.
Revert vnode_pager_setsize() back when UFS_BALLOC() failed, for ffs_truncate() and ffs_write().
PR: 129956 Reviewed by: ups MFC after: 3 weeks
|
187467 |
20-Jan-2009 |
kib |
FFS puts the extended attributes blocks at the negative blocks for the vnode, from -1 down. When vinvalbuf(vp, V_ALT) is done for the vnode, it incorrectly does vm_object_page_remove(0, 0), removing all pages from the underlying vm object, not only the pages that back the extended attributes data.
Change vinvalbuf() to not remove any pages from the object when V_NORMAL or V_ALT are specified. Instead, the only in-tree caller in ffs_inode.c:ffs_truncate() that specifies V_ALT explicitely removes the corresponding page range. The V_NORMAL caller does vnode_pager_setsize(vp, 0) immediately after the call to vinvalbuf(V_NORMAL) already.
Reported by: csjp Reviewed by: ups MFC after: 3 weeks
|
186897 |
08-Jan-2009 |
kib |
If unmount of the ffs mp failed, reinitialize the extended attributes for the mp, and restart them if autostart is enabled.
Reported and tested by: pho Reviewed by: rwatson MFC after: 3 weeks
|
184934 |
13-Nov-2008 |
ambrisko |
For now on every 10 cyclinder groups flush the buffer cache to free up space. If the buffer cache fills up then the disk systems can grind to a halt. Better tuning can be figured out later.
Tested by: Tim, others and work Reviewed by: Kostik Belousov PR: 128832
|
184554 |
02-Nov-2008 |
attilio |
Improve VFS locking: - Implement real draining for vfs consumers by not relying on the mnt_lock and using instead a refcount in order to keep track of lock requesters. - Due to the change above, remove the mnt_lock lockmgr because it is now useless. - Due to the change above, vfs_busy() is no more linked to a lockmgr. Change so its KPI by removing the interlock argument and defining 2 new flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the old version (which was unlinked from the lockmgr alredy) and MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx once the mnt interlock is held (ability still desired by most consumers). - The stub used into vfs_mount_destroy(), that allows to override the mnt_ref if running for more than 3 seconds, make it totally useless. Remove it as it was thought to work into older versions. If a problem of "refcount held never going away" should appear, we will need to fix properly instead than trust on such hackish solution. - Fix a bug where returning (with an error) from dounmount() was still leaving the MNTK_MWAIT flag on even if it the waiters were actually woken up. Just a place in vfs_mount_destroy() is left because it is going to recycle the structure in any case, so it doesn't matter. - Remove the markercnt refcount as it is useless.
This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and __FreeBSD_version will be modified accordingly.
Discussed with: kib Tested by: pho
|
184413 |
28-Oct-2008 |
trasz |
Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary to add more V* constants, and the variables changed by this patch were often being assigned to mode_t variables, which is 16 bit.
Approved by: rwatson (mentor)
|
184214 |
23-Oct-2008 |
des |
Fix a number of style issues in the MALLOC / FREE commit. I've tried to be careful not to fix anything that was already broken; the NFSv4 code is particularly bad in this respect.
|
184205 |
23-Oct-2008 |
des |
Retire the MALLOC and FREE macros. They are an abomination unto style(9).
MFC after: 3 months
|
184074 |
20-Oct-2008 |
kib |
Assert that v_holdcnt is non-zero before entering lockmgr in vn_lock and ffs_lock. This cannot catch situations where holdcnt is incremented not by curthread, but I think it is useful.
Reviewed by: tegge, attilio Tested by: pho MFC after: 2 weeks
|
183822 |
13-Oct-2008 |
kib |
Sync up summary information for cylinder groups while data is already in memory during snapshot creation. This improves the results of the background fsck.
Submitted by: tegge MFC after: 1 week
|
183754 |
10-Oct-2008 |
attilio |
Remove the struct thread unuseful argument from bufobj interface. In particular following functions KPI results modified: - bufobj_invalbuf() - bufsync()
and BO_SYNC() "virtual method" of the buffer objects set. Main consumers of bufobj functions are affected by this change too and, in particular, functions which changed their KPI are: - vinvalbuf() - g_vfs_close()
Due to the KPI breakage, __FreeBSD_version will be bumped in a later commit.
As a side note, please consider just temporary the 'curthread' argument passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP
Reviewed by: kib Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
|
183331 |
24-Sep-2008 |
jhb |
Enable shared lookups on UFS. There are some remaining issues with forced unmounts, but those are in the VFS lookup code are not UFS specific.
Tested by: pho, kris
|
183074 |
16-Sep-2008 |
kib |
Suspend the write operations on the UFS filesystem being unmounted or remounted from rw to ro.
Proposed and reviewed by: tegge In collaboration with: pho MFC after: 1 month
|
183073 |
16-Sep-2008 |
kib |
When attempt is made to suspend a filesystem that is already syspended, wait until the current suspension is lifted instead of silently returning success immediately. The consequences of calling vfs_write() resume when not owning the suspension are not well-defined at best.
Add the vfs_susp_clean() mount method to be called from vfs_write_resume(). Set it to process_deferred_inactive() for ffs, and stop calling it manually.
Add the thread flag TDP_IGNSUSP that allows to bypass the suspension point in the vn_start_write. It is intended for use by VFS in the situations where the suspender want to do some i/o requiring calls to vn_start_write(), and this i/o cannot be done later.
Reviewed by: tegge In collaboration with: pho MFC after: 1 month
|
183072 |
16-Sep-2008 |
kib |
Add the ffs structures introspection functions for ddb. Show the b_dep value for the buffer in the show buffer command. Add a comand to dump the dirty/clean buffer list for vnode.
Reviewed by: tegge Tested and used by: pho MFC after: 1 month
|
183070 |
16-Sep-2008 |
kib |
When downgrading the read-write mount to read-only, do_unmount() sets MNT_RDONLY flag before the VFS_MOUNT() is called. In ufs_inactive() and ufs_itimes_locked(), UFS verifies whether the fs is read-only by checking MNT_RDONLY, but this may cause loss of the IN_MODIFIED flag for inode on the fs being remounted rw->ro.
Introduce UFS_RDONLY() struct ufsmount' method that reports the value of the fs_ronly. The later is set to 1 only after the remount is finished.
Reviewed by: tegge In collaboration with: pho MFC after: 1 month
|
183067 |
16-Sep-2008 |
kib |
The struct inode *ip supplied to softdep_freefile is not neccessary the inode having number ino. In r170991, the ip was marked IN_MODIFIED, that is not quite correct.
Mark only the right inode modified by checking inode number.
Reviewed by: tegge In collaboration with: pho MFC after: 1 month
|
182721 |
03-Sep-2008 |
trasz |
When calling extattr_check_cred, use V{READ,WRITE}, not I{READ,WRITE}.
Approved by: rwatson (mentor)
|
182542 |
31-Aug-2008 |
attilio |
Decontextualize vfs_busy(), vfs_unbusy() and vfs_mount_alloc() functions.
Manpages are updated accordingly.
Tested by: Diego Sardina <siarodx at gmail dot com>
|
182371 |
28-Aug-2008 |
attilio |
Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread was always curthread and totally unuseful.
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
|
182366 |
28-Aug-2008 |
kib |
In ffs_valloc(), ffs_vget() may fail because insmntque() refused to insert new vnode into the mount vnode list. Then, for the SU-enabled mount, ffs_vfree could create freefile dependency. This dependency can hang around forever since inode is not marked as IN_MODIFIED and correspondingly inodeblock may be not marked as dirty.
After ffs_vget() fails, retry with FFSV_FORCEINSMQ, mark the inode as modified, and vput() it immediately. Take care of the dup alloc.
Tested by: pho Reviewed by: tegge MFC after: 1 month
|
182365 |
28-Aug-2008 |
kib |
Softdep code may need to instantiate vnode when processing dependencies. In particular, it may need this while syncing filesystem being unmounted. Since during unmount MNTK_NOINSMNTQUE flag is set, that could sometimes disallow insertion of the vnode into the vnode mount list, softdep code needs to overwrite the MNTK_NOINSMNTQUE flag.
Create the ffs_vgetf() function that sets the VV_FORCEINSMQ flag for new vnode and use it consistently from the softdep code instead of ffs_vget().
Add the retry logic to the softdep_flushfiles() to flush the vnodes that could be instantiated while flushing softdep dependencies.
Tested by: pho, kris Reviewed by: tegge MFC after: 1 month
|
181528 |
10-Aug-2008 |
kib |
Revert r181345. Move the NULL pointer check to the vfs_deleteopt() function.
Discussed with: rodrigc MFC after: 3 days
|
181345 |
06-Aug-2008 |
kib |
User may do "mount -o snapshot ...", that causes new FFS mount to be performed with snapshot option, while the mp->mnt_opt is NULL. Protect against NULL pointer dereference.
Noted by: Mateusz Guzik <mjguzik gmail com> MFC after: 3 days
|
180758 |
23-Jul-2008 |
kib |
The ffs_balloc_ufs{1,2} functions call bdwrite() while having several vnode buffers locked at once. In particular, there are indirect buffers among locked ones. The bdwrite() may start the flushing to keep dirty buffer list at the bounds. If any buffer on the dirty list requires translation from logical to physical block number, code may ends up trying to lock an indirect buffer already locked in ffs_balloc_ufsX.
Prevent the bdflush() activity when several buffers are locked at once by setting the TDP_INBDFUSH for the problematic code blocks.
Reported and tested by: pho, Josef Buchsteiner at Juniper In collaboration with: kan MFC after: 1 month
|
180621 |
19-Jul-2008 |
pjd |
Say hi to svn, by simplifing ffs_vget() function a bit - there is no need for a variable that is used only once.
|
179295 |
24-May-2008 |
rodrigc |
Fix comments to replace SBSIZE with SBLOCKSIZE, since SBSIZE was renamed to SBLOCKSIZE in version 1.33
Reviewed by: mckusick
|
179270 |
24-May-2008 |
rodrigc |
After converting the "snapshot" mount option to the MNT_SNAPSHOT flag, delete "snapshot" from the persistent mount options list. This should fix problems with doing a mount -o snapshot of a file system, followed by an NFS export of the same file system.
PR: 122833 Reported by: Leon Kos <leon.kos lecad fs uni-lj si>, Jaakko Heinonen <jh saunalahti fi> MFC after: 1 month
|
179269 |
24-May-2008 |
rodrigc |
For the following mount options, do not perform the string to flag conversions here, because we already do them further up in vfs_donmount() in vfs_mount.c
async -> MNT_ASYNC force -> MNT_FORCE multilabel -> MNT_MULTILABEL noatime -> MNT_NOATIME noclusterr -> MNT_NOCLUSTERR noclusterw -> MNT_NOCLUSTERW
MFC after: 1 month
|
177957 |
06-Apr-2008 |
attilio |
Optimize lockmgr in order to get rid of the pool mutex interlock, of the state transitioning flags and of msleep(9) callings. Use, instead, an algorithm very similar to what sx(9) and rwlock(9) alredy do and direct accesses to the sleepqueue(9) primitive.
In order to avoid writer starvation a mechanism very similar to what rwlock(9) uses now is implemented, with the correspective per-thread shared lockmgrs counter.
This patch also adds 2 new functions to lockmgr KPI: lockmgr_rw() and lockmgr_args_rw(). These two are like the 2 "normal" versions, but they both accept a rwlock as interlock. In order to realize this, the general lockmgr manager function "__lockmgr_args()" has been implemented through the generic lock layer. It supports all the blocking primitives, but currently only these 2 mappers live.
The patch drops the support for WITNESS atm, but it will be probabilly added soon. Also, there is a little race in the draining code which is also present in the current CVS stock implementation: if some sharers, once they wakeup, are in the runqueue they can contend the lock with the exclusive drainer. This is hard to be fixed but the now committed code mitigate this issue a lot better than the (past) CVS version. In addition assertive KA_HELD and KA_UNHELD have been made mute assertions because they are dangerous and they will be nomore supported soon.
In order to avoid namespace pollution, stack.h is splitted into two parts: one which includes only the "struct stack" definition (_stack.h) and one defining the KPI. In this way, newly added _lockmgr.h can just include _stack.h.
Kernel ABI results heavilly changed by this commit (the now committed version of "struct lock" is a lot smaller than the previous one) and KPI results broken by lockmgr_rw() / lockmgr_args_rw() introduction, so manpages and __FreeBSD_version will be updated accordingly.
Tested by: kris, pho, jeff, danger Reviewed by: jeff Sponsored by: Google, Summer of Code program 2007
|
177785 |
31-Mar-2008 |
kib |
Add the support for the AT_FDCWD and fd-relative name lookups to the namei(9).
Based on the submission by rdivacky, sponsored by Google Summer of Code 2007 Reviewed by: rwatson, rdivacky Tested by: pho
|
177779 |
31-Mar-2008 |
jeff |
- Since rev 1.142 of ffs_snapshot.c the interlock has not been required to protect the v_lock pointer. Removing the interlock acquisition here allows vn_lock() to proceed without requiring the interlock at all. - If the lock mutated while we were sleeping on it the interlock has been dropped. It is conceivable that the upper layer code was relying on the interlock and LK_NOWAIT to protect the identity or state of the vnode while acquiring the lock. In this case return EBUSY rather than trying the new lock to prevent potential races.
Reviewed by: tegge
|
177778 |
31-Mar-2008 |
jeff |
- Don't free snapdata structures when they are no longer in use. Keeping the lockmgr lock valid allows us to switch the v_lock pointer in snapshot vnodes between the embedded lockmgr lock and snapdata lock without needing the vnode interlock to protect against races - Keep unused snapdata structures in a list. - Add a function to lock the devvp and allocate a snapdata to it or acquire a new one without races. The old function was safe from creation races because we set the mount flag when creating snapshots and thus serializing them. However, it might have been subject to destroying races.
Reviewed by: tegge
|
177645 |
26-Mar-2008 |
jhb |
Fix a nit with the 'nofoo' options where 'foo' is mapped to 'nonofoo' (such as 'atime' vs 'noatime'). The filesystems will always see either 'nofoo' or 'nonofoo', never plain 'foo'. As such, their list of valid mount options should include 'nofoo' instead of 'foo'. With this fix, you can do 'mount -u -o atime' on a FFS filesystem that isn't marked as noatime without getting an error. You can also update a noatime FFS filesystem mounted via mount(2) (e.g. 6.x /sbin/mount binary) to 'atime' using nmount(2) (e.g. 7.x /sbin/mount binary).
MFC after: 1 week Reviewed by: crodig
|
177528 |
23-Mar-2008 |
kib |
Yield the cpu in the kernel while iterating the list of the vnodes belonging to the mountpoint. Also, yield when in the softdep_process_worklist() even when we are not going to sleep due to buffer drain.
It is believed that the ULE fixed the problem [1], but the yielding seems to be needed at least for the 4BSD case.
Discussed: on stable@, with bde Reviewed by: tegge, jeff [1] MFC after: 2 weeks
|
177493 |
22-Mar-2008 |
jeff |
- Complete part of the unfinished bufobj work by consistently using BO_LOCK/UNLOCK/MTX when manipulating the bufobj. - Create a new lock in the bufobj to lock bufobj fields independently. This leaves the vnode interlock as an 'identity' lock while the bufobj is an io lock. The bufobj lock is ordered before the vnode interlock and also before the mnt ilock. - Exploit this new lock order to simplify softdep_check_suspend(). - A few sync related functions are marked with a new XXX to note that we may not properly interlock against a non-zero bv_cnt when attempting to sync all vnodes on a mountlist. I do not believe this race is important. If I'm wrong this will make these locations easier to find.
Reviewed by: kib (earlier diff) Tested by: kris, pho (earlier diff)
|
177474 |
21-Mar-2008 |
kib |
Reduce the acquisition of the vnode interlock in the ffs_read() and ffs_extread() when setting the IN_ACCESS flag by checking whether the IN_ACCESS is already set. The possible race there is admissible.
Tested by: pho Submitted by: jeff
|
177368 |
19-Mar-2008 |
jeff |
- Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from requiring the per-process spinlock to only requiring the process lock. - Reflect these changes in the proc.h documentation and consumers throughout the kernel. This is a substantial reduction in locking cost for these fields and was made possible by recent changes to threading support.
|
177253 |
16-Mar-2008 |
rwatson |
In keeping with style(9)'s recommendations on macros, use a ';' after each SYSINIT() macro invocation. This makes a number of lightweight C parsers much happier with the FreeBSD kernel source, including cflow's prcc and lxr.
MFC after: 1 month Discussed with: imp, rink
|
177156 |
13-Mar-2008 |
cokane |
Replace the non-MPSAFE timeout(9) API in ffs_softdep.c with the MPSAFE callout_* API (e.g. callout_init_mtx(9)). This was one of the numerous items on the http://wiki.freebsd.org/SMPTODO list.
Reviewed by: imp, obrien, jhb MFC after: 1 week
|
177034 |
10-Mar-2008 |
emaste |
Remove include of opt_quota.h; as of revision 1.205 there is no longer any #ifdef QUOTA conditional code.
|
176831 |
05-Mar-2008 |
kib |
Initialize mnt_stat.f_iosize before autostarting UFS1 extattrs. It is normally initialized by ffs_statfs() after ffs_mount finished.
The extattr autostart code calls the ufs_lookup(), that uses value above to iterate over the directory blocks, see bmask initialization in the ufs_lookup() and ufsdirhash. Having the filesystem with root directory spanning more then one block would result in reading a random kernel memory.
PR: kern/120781 Test case provided by: rwatson MFC after: 1 week
|
176795 |
04-Mar-2008 |
rwatson |
Move setting of MNTK_MPSAFE flag before UFS1 extended attribute auto-start so that the flag is set before we start performing I/O in the auto-start routine.
MFC after: 2 weeks Suggested by: kib
|
176564 |
25-Feb-2008 |
keramida |
Minor typo nit.
|
176559 |
25-Feb-2008 |
attilio |
Axe the 'thread' argument from VOP_ISLOCKED() and lockstatus() as it is always curthread.
As KPI gets broken by this patch, manpages and __FreeBSD_version will be updated by further commits.
Tested by: Andrea Barberio <insomniac at slackware dot it>
|
176519 |
24-Feb-2008 |
attilio |
Introduce some functions in the vnode locks namespace and in the ffs namespace in order to handle lockmgr fields in a controlled way instead than spreading all around bogus stubs: - VN_LOCK_AREC() allows lock recursion for a specified vnode - VN_LOCK_ASHARE() allows lock sharing for a specified vnode
In FFS land: - BUF_AREC() allows lock recursion for a specified buffer lock - BUF_NOREC() disallows recursion for a specified buffer lock
Side note: union_subr.c::unionfs_node_update() is the only other function directly handling lockmgr fields. As this is not simple to fix, it has been left behind as "sole" exception.
|
176320 |
15-Feb-2008 |
attilio |
- Introduce lockmgr_args() in the lockmgr space. This function performs the same operation of lockmgr() but accepting a custom wmesg, prio and timo for the particular lock instance, overriding default values lkp->lk_wmesg, lkp->lk_prio and lkp->lk_timo. - Use lockmgr_args() in order to implement BUF_TIMELOCK() - Cleanup BUF_LOCK() - Remove LK_INTERNAL as it is nomore used in the lockmgr namespace
Tested by: Andrea Barberio <insomniac at slackware dot it>
|
175635 |
24-Jan-2008 |
attilio |
Cleanup lockmgr interface and exported KPI: - Remove the "thread" argument from the lockmgr() function as it is always curthread now - Axe lockcount() function as it is no longer used - Axe LOCKMGR_ASSERT() as it is bogus really and no currently used. Hopefully this will be soonly replaced by something suitable for it. - Remove the prototype for dumplockinfo() as the function is no longer present
Addictionally: - Introduce a KASSERT() in lockstatus() in order to let it accept only curthread or NULL as they should only be passed - Do a little bit of style(9) cleanup on lockmgr.h
KPI results heavilly broken by this change, so manpages and FreeBSD_version will be modified accordingly by further commits.
Tested by: matteo
|
175486 |
19-Jan-2008 |
attilio |
- Introduce the function lockmgr_recursed() which returns true if the lockmgr lkp, when held in exclusive mode, is recursed - Introduce the function BUF_RECURSED() which does the same for bufobj locks based on the top of lockmgr_recursed() - Introduce the function BUF_ISLOCKED() which works like the counterpart VOP_ISLOCKED(9), showing the state of lockmgr linked with the bufobj
BUF_RECURSED() and BUF_ISLOCKED() entirely replace the usage of bogus BUF_REFCNT() in a more explicative and SMP-compliant way. This allows us to axe out BUF_REFCNT() and leaving the function lockcount() totally unused in our stock kernel. Further commits will axe lockcount() as well as part of lockmgr() cleanup.
KPI results, obviously, broken so further commits will update manpages and freebsd version.
Tested by: kris (on UFS and NFS)
|
175294 |
13-Jan-2008 |
attilio |
VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in conjuction with 'thread' argument passing which is always curthread. Remove the unuseful extra-argument and pass explicitly curthread to lower layer functions, when necessary.
KPI results broken by this change, which should affect several ports, so version bumping and manpage update will be further committed.
Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
|
175202 |
10-Jan-2008 |
attilio |
vn_lock() is currently only used with the 'curthread' passed as argument. Remove this argument and pass curthread directly to underlying VOP_LOCK1() VFS method. This modify makes the code cleaner and in particular remove an annoying dependence helping next lockmgr() cleanup. KPI results, obviously, changed.
Manpage and FreeBSD_version will be updated through further commits.
As a side note, would be valuable to say that next commits will address a similar cleanup about VFS methods, in particular vop_lock1 and vop_unlock.
Tested by: Diego Sardina <siarodx at gmail dot com>, Andrea Di Pasquale <whyx dot it at gmail dot com>
|
175068 |
03-Jan-2008 |
kib |
ffs_balloc_ufsX() routines, in the case of recovering from the failed allocation, free the indirect blocks before clearing the disk pointers, that could lead to the softupdate inconsistencies in the case of the machine or disk crash at the wrong time.
Rearrange the recover code to do the ffs_blkfree() after the second ffs_syncvnode(), that clears the pointers chain.
Proposed and reviewed by: tegge Tested by: Peter Holm MFC after: 3 weeks
|
175053 |
02-Jan-2008 |
obrien |
style(9)
|
174973 |
29-Dec-2007 |
kib |
The ffs_balloc() routines, whan allocating the indirect blocks for the inode, do the rollback in case the allocation failed (due to insufficient free space or quota limits). But, the code does leaves the buffers corresponding to the inoirect blocks on the vnode bufobj list. This causes several assertion failures (for instance, "ffs_truncate3" in ffs_truncate()) to fail, and could result in the indirect block aliasing problem, like writing the context of such blocks to random disk location.
Remove the buffers from the bufobj properly.
Reported and tested by: Peter Holm Reviewed by: tegge MFC after: 3 weeks
|
174126 |
01-Dec-2007 |
kensmith |
Fix a broken check that recently became more annoying because it now gets enabled when INVARIANTS is on instead of DIAGNOSTIC (which apparently nobody uses). From Tor's description:
This happens when the block range spans two block maps, the first in the inode (mapping up to NDADDR direct blocks) and the second being the first indirect block. The current check assumes that both block maps are indirect blocks.
Work done by: tegge Tested by: kris, kensmith
|
173501 |
09-Nov-2007 |
ru |
Fix build without INVARIANTS and update a comment to match a change made in previous revision.
|
173464 |
08-Nov-2007 |
obrien |
Turn most ffs 'DIAGNOSTIC's into INVARIANTS.
|
172930 |
24-Oct-2007 |
rwatson |
Merge first in a series of TrustedBSD MAC Framework KPI changes from Mac OS X Leopard--rationalize naming for entry points to the following general forms:
mac_<object>_<method/action> mac_<object>_check_<method/action>
The previous naming scheme was inconsistent and mostly reversed from the new scheme. Also, make object types more consistent and remove spaces from object types that contain multiple parts ("posix_sem" -> "posixsem") to make mechanical parsing easier. Introduce a new "netinet" object type for certain IPv4/IPv6-related methods. Also simplify, slightly, some entry point names.
All MAC policy modules will need to be recompiled, and modules not updates as part of this commit will need to be modified to conform to the new KPI.
Sponsored by: SPARTA (original patches against Mac OS X) Obtained from: TrustedBSD Project, Apple Computer
|
172836 |
20-Oct-2007 |
julian |
Rename the kthread_xxx (e.g. kthread_create()) calls to kproc_xxx as they actually make whole processes. Thos makes way for us to add REAL kthread_create() and friends that actually make theads. it turns out that most of these calls actually end up being moved back to the thread version when it's added. but we need to make this cosmetic change first.
I'd LOVE to do this rename in 7.0 so that we can eventually MFC the new kthread_xxx() calls.
|
172697 |
16-Oct-2007 |
alfred |
Get rid of qaddr_t.
Requested by: bde
|
172113 |
10-Sep-2007 |
bz |
Fix a DIV0 in case a large value for fs_avgfilesize or fs_avgfpdir is given (with newfs or tunefs) and dirsize overflows.
In case dirsize is <= 0 because of an overflow set maxcontigdirs to 0 so it will be 1 later. This is what would happen for large fs_avgfilesize. [1]
Identified with help from: roberto, pjd Submitted by: pjd [1] Approved by: re (rwatson) MFC after: 8 days
|
171437 |
13-Jul-2007 |
rodrigc |
Perform range check before allocating memory when reading extended attributes.
Reviewed by: kib Approved by: re (hrs) PR: 114389
|
170991 |
22-Jun-2007 |
kib |
Fix livelock that could occur when snapshoting UFS with quotas, where some quota limit was exceeded. Sequence of UFS_VALLOC()/UFS_VFREE() call there could cause inodeblock to have both freefile and inodedep dependencies without any inode in the block being marked for write. Then, softdep_check_suspend() would return EAGAIN forewer.
Force write of inodeblock with allocated freefile softdependency by setting IN_MODIFIED flag in softdep_freefile and unconditionally calling UFS_UPDATE() in ufs_reclaim.
Reported by: kris Debug help and tested by: Peter Holm Approved by: re (kensmith) MFC after: 3 weeks
|
170587 |
12-Jun-2007 |
rwatson |
Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in some cases, move to priv_check() if it was an operation on a thread and no other flags were present.
Eliminate caller-side jail exception checking (also now-unused); jail privilege exception code now goes solely in kern_jail.c.
We can't yet eliminate suser() due to some cases in the KAME code where a privilege check is performed and then used in many different deferred paths. Do, however, move those prototypes to priv.h.
Reviewed by: csjp Obtained from: TrustedBSD Project
|
170307 |
05-Jun-2007 |
jeff |
Commit 14/14 of sched_lock decomposition. - Use thread_lock() rather than sched_lock for per-thread scheduling sychronization. - Use the per-process spinlock rather than the sched_lock for per-process scheduling synchronization.
Tested by: kris, current@ Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc. Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
|
170174 |
01-Jun-2007 |
jeff |
- Move rusage from being per-process in struct pstats to per-thread in td_ru. This removes the requirement for per-process synchronization in statclock() and mi_switch(). This was previously supported by sched_lock which is going away. All modifications to rusage are now done in the context of the owning thread. reads proceed without locks. - Aggregate exiting threads rusage in thread_exit() such that the exiting thread's rusage is not lost. - Provide a new routine, rufetch() to fetch an aggregate of all rusage structures from all threads in a process. This routine must be used in any place requiring a rusage from a process prior to it's exit. The exited process's rusage is still available via p_ru. - Aggregate tick statistics only on demand via rufetch() or when a thread exits. Tick statistics are kept in the thread and protected by sched_lock until it exits.
Initial patch by: attilio Reviewed by: attilio, bde (some objections), arch (mostly silent)
|
169671 |
18-May-2007 |
kib |
Since renaming of vop_lock to _vop_lock, pre- and post-condition function calls are no more generated for vop_lock. Rename _vop_lock to vop_lock1 to satisfy tools/vnode_if.awk assumption about vop naming conventions. This restores pre/post-condition calls.
|
169239 |
03-May-2007 |
thompsa |
Add a newline to the printf message.
|
168576 |
10-Apr-2007 |
kib |
Fix the NAMEI zone leak when snapshot was successfully created.
Reported and tested by: Peter Holm MFC after: 2 weeks
|
168575 |
10-Apr-2007 |
kib |
Recalculate the NEWBLOCK flag for pagedep structure after the softdep lock is dropped, since pagedep may be already processed and deallocated.
Found and tested by: kris MFC after: 2 weeks
|
168574 |
10-Apr-2007 |
kib |
When LK_NOWAIT is passed as argument to process_worklist_item(), this does not prevent handle_workitem_remove() from recursing into a blocking version. Add the dirrem to worklist instead of processing it now if this is the case.
Reported and tested by: kris Submitted by: tegge MFC after: 2 weeks
|
168353 |
04-Apr-2007 |
delphij |
Use *_EMPTY macros when appropriate.
|
168021 |
29-Mar-2007 |
kib |
Revert rev. 1.205. Replace unconditional acquision of Giant when QUOTAS are defined with VFS_LOCK_GIANT(NULL) call. This shall fix softdep operation when mpsafe_vfs = 0.
Reported and tested by: kris Submitted by: tegge MFC after: 1 week
|
167737 |
20-Mar-2007 |
kib |
Mark UFS as being MP-Safe in "options QUOTA" case too. Remove no more neccessary Giant acquisions in softdepend processing code.
Tested by: Peter Holm Reviewed by: tegge Approved by: re (kensmith)
|
167719 |
19-Mar-2007 |
brian |
When we write extended attributes, assert that the inode hasn't already been deleted. The assertion is important to show that we won't end up accounting for extended attribute blocks (using fs_pendingblocks) in our subsequent call to fs_alloc().
Agreed verbally by: mckusick
MFC after: 3 weeks
|
167543 |
14-Mar-2007 |
kib |
Implement fine-grained locking for UFS quotas.
Each struct dquot gets dq_lock mutex to protect dq_flags and to interlock with DQ_LOCK. qhash, dqfreelist and dq.dq_cnt are protected by global dqhlock mutex.
i_dquot array for inode is protected by lockmgr' vnode lock, corresponding assert added to the dqget(). Access to struct ufsmount quota-related fields (um_quotas and um_qflags) is protected by um_lock.
Tested by: Peter Holm Reviewed by: tegge Approved by: re (kensmith)
This work were not possible without enormous amount of help given by Tor Egge and Peter Holm. Tor reviewed each version of patch, pointed out numerous errors and provided invaluable suggestions. Peter did tireless testing of the patch as it was developed.
|
167497 |
13-Mar-2007 |
tegge |
Make insmntque() externally visibile and allow it to fail (e.g. during late stages of unmount). On failure, the vnode is recycled.
Add insmntque1(), to allow for file system specific cleanup when recycling vnode on failure.
Change getnewvnode() to no longer call insmntque(). Previously, embryonic vnodes were put onto the list of vnode belonging to a file system, which is unsafe for a file system marked MPSAFE.
Change vfs_hash_insert() to no longer lock the vnode. The caller now has that responsibility.
Change most file systems to lock the vnode and call insmntque() or insmntque1() after a new vnode has been sufficiently setup. Handle failed insmntque*() calls by propagating errors to callers, possibly after some file system specific cleanup.
Approved by: re (kensmith) Reviewed by: kib In collaboration with: kib
|
167155 |
01-Mar-2007 |
pjd |
Fix build breakage.
|
167152 |
01-Mar-2007 |
pjd |
Rename PRIV_VFS_CLEARSUGID to PRIV_VFS_RETAINSUGID, which seems to better describe the privilege.
OK'ed by: rwatson
|
167151 |
01-Mar-2007 |
pjd |
Avoid checking for privileges if there is no need to.
Discussed with: rwatson
|
166924 |
23-Feb-2007 |
brian |
Account for di_blocks allocations when IN_SPACECOUNTED is set in an inode's i_flag.
It's possible that after ufs_infactive() calls softdep_releasefile(), i_nlink stays >0 for a considerable amount of time (> 60 seconds here). During this period, any ffs allocation routines that alter di_blocks must also account for the blocks in the filesystem's fs_pendingblocks value.
This change fixes an eventual df/du discrepency that will happen as the result of fs_pendingblocks being reduced to <0.
The only manifestation of this that people may recognise is the following message on boot:
/somefs: update error: blocks -N files M
at which point the negative pending block count is adjusted to zero.
Reviewed by: tegge MFC after: 3 weeks
|
166864 |
21-Feb-2007 |
mckusick |
The functions that set and delete external attributes must check that the filesystem is not mounted read-only before proceeding.
Reported by: Ryan Beasley <ryanb@FreeBSD.org> MFC after: 1 week
|
166799 |
17-Feb-2007 |
mckusick |
This README file is obsolete. The cited problems were fixed long ago and the code is installed by default so no longer requires action by the administrator to be included.
|
166774 |
15-Feb-2007 |
pjd |
Move vnode-to-file-handle translation from vfs_vptofh to vop_vptofh method. This way we may support multiple structures in v_data vnode field within one file system without using black magic.
Vnode-to-file-handle should be VOP in the first place, but was made VFS operation to keep interface as compatible as possible with SUN's VFS. BTW. Now Solaris also implements vnode-to-file-handle as VOP operation.
VFS_VPTOFH() was left for API backward compatibility, but is marked for removal before 8.0-RELEASE.
Approved by: mckusick Discussed with: many (on IRC) Tested with: ufs, msdosfs, cd9660, nullfs and zfs
|
166506 |
04-Feb-2007 |
tegge |
Call pbgetvp() and pbrelvp() instead of setting b_vp directly.
PR: kern/108151
|
166193 |
23-Jan-2007 |
kib |
Cylinder group bitmaps and blocks containing inode for a snapshot file are after snaplock, while other ffs device buffers are before snaplock in global lock order. By itself, this could cause deadlock when bdwrite() tries to flush dirty buffers on snapshotted ffs. If, during the flush, COW activity for snapshot needs to allocate block and ffs_alloccg() selects the cylinder group that is being written by bdwrite(), then kernel would panic due to recursive buffer lock acquision.
Avoid dealing with buffers in bdwrite() that are from other side of snaplock divisor in the lock order then the buffer being written. Add new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in the bdwrite(). Default implementation, bufbdflush(), refactors the code from bdwrite(). For ffs device buffers, specialized implementation is used.
Reviewed by: tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes) Tested by: Peter Holm X-MFC after: 3 weeks (if ever: it changes ABI)
|
166142 |
20-Jan-2007 |
mpp |
Quota system cleanup.
1) Do not do quota accounting for the actual quota data files or for file system snapshot files ("system" files). This prevents a deadlock descibed in PR kern/30958 if the kernel ever has to grow the quota file. Snapshot files were already exempt from the quota checks, but this change generalized the check. 2) Fix a cast that caused extremely large uids/gids to incorrectly write the quota information to the data file at a truncated value for a uint_t32 id value. The incorrect cast caused quota files in this case to be around 4GB in size, with the correct cast they can now be 131GB in size. Also related to PR kern/30958. 3) Check for what appear to be negative UIDs/GIDs and not account for them. This prevents the quota files from becoming 131GB in size and causing quotacheck to run forever at bootup. This could also cause the kernel to try and expand the quota file, which might deadlock due to the issue in #1. kern/30958 and kern/38156 (and some much older closed PR's). 4) With the deadlock problems gone, the kernel can now expand the size of the quota database files if it needs to. 5) Pass in the i-node count change value to chkiq and chkiqchg as an int, like it used to be before the common routine was split up into 2 different routines to increase / decrease the i-node in-use count. Prevents an underflow on the i-node count. Related to PR kern/89247. 6) Prevent the block usage from growing slowly if a file system is full and the write was denied due to that fact. PR kern/89247.
Some of these changes require an updated quotacheck to prevent the creation of huge (131GB) quota data files (item #3).
#1/#4 probably fixes a lot of the random hangs when quotas are enabled, possibly some of the jail hangs.
|
166051 |
16-Jan-2007 |
mpp |
Fix a spelling error in some comments. heirarchy -> hierarchy.
Obtained from: OpenBSD
|
164248 |
13-Nov-2006 |
kmacy |
change vop_lock handling to allowing tracking of callers' file and line for acquisition of lockmgr locks
Approved by: scottl (standing in for mentor rwatson)
|
164033 |
06-Nov-2006 |
rwatson |
Sweep kernel replacing suser(9) calls with priv(9) calls, assigning specific privilege names to a broad range of privileges. These may require some future tweaking.
Sponsored by: nCircle Network Security, Inc. Obtained from: TrustedBSD Project Discussed on: arch@ Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri, Alex Lyashkov <umka at sevcity dot net>, Skip Ford <skip dot ford at verizon dot net>, Antoine Brodin <antoine dot brodin at laposte dot net>
|
163874 |
01-Nov-2006 |
kib |
Aquire Giant in the softdep_flush for clear_remove() and clear_inodedeps() processing when QUOTA is set.
Reported and tested by: Peter Holm Reviewed by: tegge MFC after: 3 days
|
163841 |
31-Oct-2006 |
pjd |
Add gjournal specific code to the UFS file system: - Add FS_GJOURNAL flag which enables gjournal support on a file system. - Add cg_unrefs field to the cylinder group structure which holds number of unreferenced (orphaned) inodes in the given cylinder group. - Add fs_unrefs field to the super block structure which holds total number of unreferenced (orphaned) inodes. - When file or a directory is orphaned (last reference is removed, but object is still open), increase fs_unrefs and cg_unrefs fields, which is a hint for fsck in which cylinder groups looks for such (orphaned) objects. - When file is last closed, decrease {fs,cg}_unrefs fields. - Add VV_DELETED vnode flag which points at orphaned objects.
Sponsored by: home.pl
|
163606 |
22-Oct-2006 |
rwatson |
Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now contains the userspace and user<->kernel API and definitions, with all in-kernel interfaces moved to mac_framework.h, which is now included across most of the kernel instead.
This change is the first step in a larger cleanup and sweep of MAC Framework interfaces in the kernel, and will not be MFC'd.
Obtained from: TrustedBSD Project Sponsored by: SPARTA
|
163194 |
10-Oct-2006 |
kib |
Do not translate the IN_ACCESS inode flag into the IN_MODIFIED while filesystem is suspending/suspended. Doing so may result in deadlock. Instead, set the (new) IN_LAZYACCESS flag, that becomes IN_MODIFIED when suspend is lifted.
Change the locking protocol in order to set the IN_ACCESS and timestamps without upgrading shared vnode lock to exclusive (see comments in the inode.h). Before that, inode was modified while holding only shared lock.
Tested by: Peter Holm Reviewed by: tegge, bde Approved by: pjd (mentor) MFC after: 3 weeks
|
162654 |
26-Sep-2006 |
tegge |
Protect change to bo_flag by holding the bufobj mutex.
|
162653 |
26-Sep-2006 |
tegge |
Reduce fluctuations of mnt_flag to allow unlocked readers to get a slightly more consistent view.
|
162652 |
26-Sep-2006 |
tegge |
Don't restore MNT_QUOTA bit in mnt_flag after snapshot creation, closing a race between nmount() and quotactl().
|
162650 |
26-Sep-2006 |
tegge |
Increase mnt_noasync once in softdep_mount() to disallow async io, closing a window where a file system using softupdates could be async for a short while if both MNT_UPDATE and MNT_ASYNC were passed as flags to nmount(). Add MNTK_SOFTDEP flag to ensure that softdep_mount() doesn't increase mnt_noasync multiple times.
|
162647 |
26-Sep-2006 |
tegge |
Use mount interlock to protect all changes to mnt_flag and mnt_kern_flag. This eliminates a race where MNT_UPDATE flag could be lost when nmount() raced against sync(), sync_fsync() or quotactl().
|
162460 |
20-Sep-2006 |
kib |
Fix the glitch introduced in rev. 1.93. In softdep_sync_metadata(), switch by worklist type contains two for() loops, for D_INDIRDEP and D_PAGEDEP. On error, these loops are exited by break, where the switch actually shall be leaved. Use goto instead of break to reach the error handling code.
Reported by: Peter Holm Reviewed by: tegge Approved by: pjd (mentor) MFC after: 2 weeks
|
161515 |
21-Aug-2006 |
kib |
While checking for update of snapshot file in the ffs_copyonwrite, first filter out metadata update. Otherwise, devfs vnode could be erronously interpreted as ufs one, causing further check of i_flags to use random memory.
PR: kern/100365 Debugged and fix described by: tegge Approved by: pjd (mentor) MFC after: 2 weeks
|
160462 |
18-Jul-2006 |
stefanf |
Drop two unnecessary casts.
|
160205 |
09-Jul-2006 |
pjd |
Declare UFS module version.
|
160204 |
09-Jul-2006 |
pjd |
Change fs->fs_fsmnt to mp->mnt_stat.f_mntonname in warnings about missing MAC and ACLs support in the kernel. If it is a first mount, fs->fs_fsmnt is empty.
MFC after: 1 week
|
159209 |
03-Jun-2006 |
rodrigc |
Check the sectorsize of the underlying disk before trying to bread() the UFS superblock. Should eliminate crashes when trying to do: mount -t ufs on an audio CD.
PR: kern/85893 Reported by: Russell Francis <rfrancis at ev dot net> MFC after: 1 week
|
158952 |
26-May-2006 |
rodrigc |
Remove "update" from ffs_opts. It has been moved to global_opts in vfs_mount.c.
|
158924 |
26-May-2006 |
rodrigc |
Remove calls to vfs_export() for exporting a filesystem for NFS mounting from individual filesystems. Call it instead in vfs_mount.c, after we call VFS_MOUNT() for a specific filesystem.
|
158867 |
24-May-2006 |
rodrigc |
Take errmsg out of ffs_opts. It is already part of global_opts in vfs_mount.c.
|
158659 |
16-May-2006 |
trhodes |
Provide a less cryptic panic message in place of just "found inode."
|
158636 |
16-May-2006 |
tegge |
Read block hints list from last snapshot on the active snapshot list.
|
158634 |
15-May-2006 |
tegge |
Copy last block on file system again after file system has been suspended.
Obtained from: NetBSD
|
158633 |
15-May-2006 |
tegge |
Don't leak a locked buffer if last block on file system cannot be read.
|
158632 |
15-May-2006 |
tegge |
Errors detected while file system is suspended should not trigger an assertion failure.
|
158527 |
13-May-2006 |
tegge |
Expunge traces of unlinked snapshot files when making a new snapshot.
|
158338 |
06-May-2006 |
tegge |
ffs_syncvnode() might skip some of the blocks due to them being locked, assuming them to be inflight write buffers. This is not always the case. bufdaemon might hold the buffer lock and give up writing the buffer due to it having dependencies, the file system being suspended or the vnode lock being held by another thread. When bufdaemon decides to write the buffer there is still a window before bufobj_wref() has been called, allowing other threads to believe that the vnode has no dirty buffers or inflight writes.
Try harder to flush first block of new subdirectory to get rid of MKDIR_BODY dependency.
|
158325 |
05-May-2006 |
tegge |
Return error if vnode was reclaimed while it was temporarily unlocked. Add missing calls to vn_finished_write() in error handling.
|
158322 |
05-May-2006 |
tegge |
Turn off disk quotas for snapshot files.
|
158321 |
05-May-2006 |
tegge |
Avoid locking overhead when snapshots are disabled.
|
158308 |
05-May-2006 |
pjd |
- Set bio_done directly to NULL to indicate that we want to wait for the bio. - Use biowait() instead of copying the code.
MFC after: 1 month
|
158262 |
03-May-2006 |
tegge |
Detect the snapshot file being prematurely unlinked.
|
158261 |
03-May-2006 |
tegge |
Temporarily undo clusters contribution to global runningbufspace while handling copy on write for the buffers taking part in the cluster.
|
158260 |
03-May-2006 |
tegge |
A side effect of calling runningbufwakeup() is that bp->b_runningbufspace is cleared. Save old value and restore bp->b_runningbufspace before returning from ffs_copyonwrite().
|
158259 |
02-May-2006 |
tegge |
Close a race when VOP_LOCK() on a snapshot file is attempted at the same time as it is changed back into a normal file. The locker would get the shared "snaplk" lock which would no longer be the correct lock for the vnode.
|
158100 |
28-Apr-2006 |
scottl |
Fix a typo.
|
158095 |
28-Apr-2006 |
jeff |
- Add a BO_NEEDSGIANT flag to the bufobj. This flag forces all child buffers to go on the buf daemon's DIRTYGIANT queue. - Set BO_NEEDSGIANT on ffs's devvp since the ffs_copyonwrite handler runs in the context of the buf daemon and may require Giant.
|
157955 |
22-Apr-2006 |
trhodes |
Revert previous to this file before an actual request is made.
|
157919 |
21-Apr-2006 |
trhodes |
Remove what I believe are two useless ifdefs. If a user or administrator enables multilabel, or any option for that matter, most likely they have a reason. This will allow users to see that mulilabel is enabled via an issued "mount" command and remove an annoying warning - printed only when a MAC kernel is not installed - on boot up.
Discussed with: green, brueffer, Samy Al Bahra. Probably ran past: csjp (though I can't remember).
|
157805 |
17-Apr-2006 |
kensmith |
Fix panic() message to give the right function name.
|
157447 |
03-Apr-2006 |
tegge |
Eliminate softdep_flush() livelock by accounting for number of worklist items marked as being in progress.
|
157325 |
31-Mar-2006 |
jeff |
- Release the references acquired by VOP_GETWRITEMOUNT and vfs_getvfs().
Discussed with: tegge Tested by: kris Sponsored by: Isilon Systems, Inc.
|
156899 |
19-Mar-2006 |
tegge |
Allow compilation when not using softupdates.
|
156898 |
19-Mar-2006 |
tegge |
Let snapshots make a copy of old contents for all buffers taking part in a cluster instead of just the first buffer.
Delay buf_start() calls until snapshots have a copy of old content.
PR: kern/93942
|
156896 |
19-Mar-2006 |
tegge |
Reduce probability of unmount failing after having unmounted snapshots.
|
156895 |
19-Mar-2006 |
tegge |
Ensure that vnode for directory isn't reclaimed before ffs_snapshot() has completed expunging unlinked files. It could come back at another memory location causing a lock order reversal.
|
156589 |
12-Mar-2006 |
jeff |
- Remove the call to softdep_waitidle after suspending the filesystem. This does not do what I wanted as all dirty buffers must be flushed by the call to ffs_sync and any remaining dependency work would mean that this failed.
Pointed out by: tegge
|
156587 |
12-Mar-2006 |
jeff |
- Remove the call to softdep_waitidle after suspending the filesystem. This does not do what I wanted as all dirty buffers must be flushed by the call to ffs_sync and any remaining dependency work would mean that this failed.
Pointed out by: tegge
|
156560 |
11-Mar-2006 |
tegge |
Block secondary writes while expunging active unlinked files.
Fix detection of active unlinked files by checking VI_OWEINACT and VI_DOINGINACT in addition to v_usecount.
Defer inactive handling for unlinked files if the file system is mostly suspended (secondary writes being blocked).
Perform deferred inactive handling after the file system is resumed.
|
156521 |
10-Mar-2006 |
tegge |
Remove unneeded (and broken) usage of MNT_REF()/MNT_REL().
|
156451 |
08-Mar-2006 |
tegge |
Use vn_start_secondary_write() and vn_finished_secondary_write() as a replacement for vn_write_suspend_wait() to better account for secondary write processing.
Close race where secondary writes could be started after ffs_sync() returned but before the file system was marked as suspended.
Detect if secondary writes or softdep processing occurred during vnode sync loop in ffs_sync() and retry the loop if needed.
|
156225 |
02-Mar-2006 |
tegge |
Eliminate a deadlock when creating snapshots. Blocking vn_start_write() must be called without any vnode locks held. Remove calls to vn_start_write() and vn_finished_write() in vnode_pager_putpages() and add these calls before the vnode lock is obtained to most of the callers that don't already have them.
|
156206 |
02-Mar-2006 |
jeff |
- Acquire lk in softdep_slowdown so that it's owned when we call softdep_speedup(). - Assert that lk is held in softdep_speedup() rather than acquiring it. This avoids a potential lock recursion.
|
156203 |
02-Mar-2006 |
jeff |
- Move softdep from using a global worklist to per-mount worklists. This has many positive effects including improved smp locking, reducing interdependencies between mounts that can lead to deadlocks, etc. - Add the softdep worklist and various counters to the ufsmnt structure. - Add a mount pointer to the workitem and remove mount pointers from the various structures derived from the workitem as they are now redundant. - Remove the poor-man's semaphore protecting softdep_process_worklist and softdep_flushworklist. Several threads may now process the list simultaneously. - Add softdep_waitidle() to block the thread until all pending dependencies being operated on by other threads have been flushed. - Use softdep_waitidle() in unmount and snapshots to block either operation until the fs is stable. - Remove softdep worklist processing from the syncer and move it into the softdep_flush() thread. This thread processes all softdep mounts once each second and when it is called via the new softdep_speedup() when there is a resource shortage. This removes the softdep hook from the kernel and various hacks in header files to support it.
Reviewed by/Discussed with: tegge, truckman, mckusick Tested by: kris
|
154152 |
09-Jan-2006 |
tegge |
Add marker vnodes to ensure that all vnodes associated with the mount point are iterated over when using MNT_VNODE_FOREACH.
Reviewed by: truckman
|
154150 |
09-Jan-2006 |
tegge |
If the lock passed to getdirtybuf() is the softdep lock then the background write completed wakeup could be missed. Close the race by grabbing the lock normally used for protection of bp->b_xflags.
Reviewed by: truckman
|
154149 |
09-Jan-2006 |
tegge |
Broaden scope of softdep_worklist_busy rwlock protection of softdep processing to avoid some dependencies being missed by softdep_flushworklist().
Reviewed by: truckman
|
154065 |
06-Jan-2006 |
imp |
New option: NO_FFS_SNAPSHOT. I did this in p4 about the same time that NetBSD implemented it independently of them (don't know which one was actually first). This saves about 24k for those times you don't need snapshot support (like when running off a ram disk, or in an embedded environment where size matters).
|
153689 |
23-Dec-2005 |
delphij |
Typo.
|
152771 |
24-Nov-2005 |
rodrigc |
Fix parsing of atime, clusterr, clusterw, exec, suid, symfollow mount options.
Noticed by: Amir Shalem < amir at boom dot org dot il>
|
152639 |
20-Nov-2005 |
rodrigc |
If export mount flag is not passed in, set default parameters for export structure and pass that to vfs_export(). Currently in userland mount(8), an export structure is unconditionally passed in, only for UFS. This is an attempt to move that UFS-specific behavior out of mount(8) and into the UFS filesystem code.
|
152622 |
19-Nov-2005 |
rodrigc |
Add more options to ffs_opts, so that vfs_filteropts() will not complain when we pass these options to a UFS filesystem as strings via nmount(): noexec, nosuid, nosymfollow, sync, suiddir
|
152567 |
18-Nov-2005 |
rodrigc |
- Add parsing for the following existing UFS/FFS mount options in the nmount() callpath via vfs_getopt(), and set the appropriate MNT_* flag: -> acls, async, force, multilabel, noasync, noatime, -> noclusterr, noclusterw, snapshot, update
- Allow errmsg as a valid mount option via vfs_getopt(), so we can later add a hook to propagate mount errors back to userspace via vfs_mount_error().
|
151906 |
31-Oct-2005 |
ps |
Rate limit filesystem full and out of inodes messages to once a second.
|
151528 |
21-Oct-2005 |
njl |
Adjust maxfilesize for UFS1 and old 4.4 FFS. For UFS1, increase the limit to (max block - 1) * bsize. For DEV_BSIZE, this doubles the limit from 0.5 TB to 1 TB. For the old 4.4 FFS case, decrease the limit from 0.5 TB to 2 GB - 1. Older systems had a 32 bit off_t so they couldn't access the larger files anyway.
Collaboration with: bde
|
151218 |
10-Oct-2005 |
tegge |
Avoid unintended VMIO on directories and symlinks due to leftover object not having been destroyed.
|
151184 |
09-Oct-2005 |
tegge |
Adjust totread argument passed to cluster_read() to account for offset not being block aligned.
|
151181 |
09-Oct-2005 |
tegge |
Don't pretend that a failed sync write was succesful.
|
151180 |
09-Oct-2005 |
tegge |
Reduce probability for a deadlock that can occur when a snapshot inode is updated by a process holding the snapshot lock. Another process updating a different inode in the same inodeblock will do copy on write checks and lock in the opposite direction.
The snapshot code force a copy on write of these blocks manually (cf. start of expunge_ufs[12]) and these inode blocks are later put on snapblklist.
This partial fix is to 'drain' the relevant ffs_copyonwrite() operation after installing new snapblklist. This is not a 100% solution since a failed block allocation can cause implicit fsync() which might deadlock before the new snapblklist has been installed.
|
151179 |
09-Oct-2005 |
tegge |
Eliminate a deadlock that can occur when a dirty block belonging to a snapshot file is flushed by a process not holding snaplk (e.g. bufdaemon). Another process might hold snaplk and try to access the block due to ffs_copyonwrite processing.
|
151178 |
09-Oct-2005 |
tegge |
Eliminate a deadlock that can occur during the cgaccount() processing due to the cg map buffer being held when writing indirect blocks. The process ends up in ffs_copyonwrite(), attempting to get snaplk while holding the cg map buffer lock.
Another process might be in ffs_copyonwrite(), trying to allocate a new block for a copy. It would hold snaplk while trying to get the cg map buffer lock.
Release the cg map buffer early and use the copy for most of the cgaccount processing to avoid this deadlock.
|
151177 |
09-Oct-2005 |
tegge |
Reduce the probability of low block numbers passed to ffs_snapblkfree() by skipping the call from ffs_snapremove() if the block number is zero.
Simplify snapshot locking in ffs_copyonwrite() and ffs_snapblkfree() by using the same locking protocol for low block numbers as for larger block numbers. This removes a lock leak that could happen if vn_lock() succeeded after lockmgr() failed in ffs_snapblkfree().
Check if snapshot is gone before retrying a lock in ffs_copyonwrite().
|
151176 |
09-Oct-2005 |
tegge |
Reinitialize v_type and v_op fields in case vnode has been reused without reclamation. If the vnode previously was a fifo then v_op would point to ffs_fifoops[12] instead of the expected ffs_vnodeops[12], causing a panic at the end of ffsext_strategy.
|
150891 |
03-Oct-2005 |
truckman |
Initialize the inode i_flag field in ffs_valloc() to clean up any stale flag bits left over from before the inode was recycled.
Without this change, a leftover IN_SPACECOUNTED flag could prevent softdep_freefile() and softdep_releasefile() from incrementing fs_pendinginodes. Because handle_workitem_freefile() unconditionally decrements fs_pendinginodes, a negative value could be reported at file system unmount time with a message like: unmount pending error: blocks 0 files -3 The pending block count in fs_pendingblocks could also be negative for similar reasons. These errors can cause the data returned by statfs() to be slightly incorrect. Some other cleanup code in softdep_releasefile() could also be incorrectly bypassed.
MFC after: 3 days
|
150791 |
01-Oct-2005 |
truckman |
Correct previous commit to fix the sense of the TDP_NORUNNINGBUF check in ffs_copyonwrite() that is a precondition for calling waitrunningbufspace().
Pointed out by: tegge Pointy hat to: truckman MFC after: 3 days
|
150760 |
30-Sep-2005 |
truckman |
Un-staticize waitrunningbufspace() and call it before returning from ffs_copyonwrite() if any async writes were launched.
Restore the threads previous TDP_NORUNNINGBUF state before returning from ffs_copyonwrite().
|
150741 |
30-Sep-2005 |
truckman |
Un-staticize runningbufwakeup() and staticize updateproc.
Add a new private thread flag to indicate that the thread should not sleep if runningbufspace is too large.
Set this flag on the bufdaemon and syncer threads so that they skip the waitrunningbufspace() call in bufwrite() rather than than checking the proc pointer vs. the known proc pointers for these two threads. A way of preventing these threads from being starved for I/O but still placing limits on their outstanding I/O would be desirable.
Set this flag in ffs_copyonwrite() to prevent bufwrite() calls from blocking on the runningbufspace check while holding snaplk. This prevents snaplk from being held for an arbitrarily long period of time if runningbufspace is high and greatly reduces the contention for snaplk. The disadvantage is that ffs_copyonwrite() can start a large amount of I/O if there are a large number of snapshots, which could cause a deadlock in other parts of the code.
Call runningbufwakeup() in ffs_copyonwrite() to decrement runningbufspace before attempting to grab snaplk so that I/O requests waiting on snaplk are not counted in runningbufspace as being in-progress. Increment runningbufspace again before actually launching the original I/O request.
Prior to the above two changes, the system could deadlock if enough I/O requests were blocked by snaplk to prevent runningbufspace from falling below lorunningspace and one of the bawrite() calls in ffs_copyonwrite() blocked in waitrunningbufspace() while holding snaplk.
See <http://www.holm.cc/stress/log/cons143.html>
|
150733 |
29-Sep-2005 |
truckman |
After a rmdir()ed directory has been truncated, force an update of the directory's inode after queuing the dirrem that will decrement the parent directory's link count. This will force the update of the parent directory's actual link to actually be scheduled. Without this change the parent directory's actual link count would not be updated until ufs_inactive() cleared the inode of the newly removed directory, which might be deferred indefinitely. ufs_inactive() will not be called as long as any process holds a reference to the removed directory, and ufs_inactive() will not clear the inode if the link count is non-zero, which could be the result of an earlier system crash.
If a background fsck is run before the update of the parent directory's actual link count has been performed, or at least scheduled by putting the dirrem on the leaf directory's inodedep id_bufwait list, fsck will corrupt the file system by decrementing the parent directory's effective link count, which was previously correct because it already took the removal of the leaf directory into account, and setting the actual link count to the same value as the effective link count after the dangling, removed, leaf directory has been removed. This happens because fsck acts based on the actual link count, which will be too high when fsck creates the file system snapshot that it references.
This change has the fortunate side effect of more quickly cleaning up the large number dirrem structures that linger for an extended time after the removal of a large directory tree. It also fixes a potential problem with the shutdown of the syncer thread timing out if the system is rebooted immediately after removing a large directory tree.
Submitted by: tegge MFC after: 3 days
|
150663 |
28-Sep-2005 |
rwatson |
Back out alpha/alpha/trap.c:1.124, osf1_ioctl.c:1.14, osf1_misc.c:1.57, osf1_signal.c:1.41, amd64/amd64/trap.c:1.291, linux_socket.c:1.60, svr4_fcntl.c:1.36, svr4_ioctl.c:1.23, svr4_ipc.c:1.18, svr4_misc.c:1.81, svr4_signal.c:1.34, svr4_stat.c:1.21, svr4_stream.c:1.55, svr4_termios.c:1.13, svr4_ttold.c:1.15, svr4_util.h:1.10, ext2_alloc.c:1.43, i386/i386/trap.c:1.279, vm86.c:1.58, unaligned.c:1.12, imgact_elf.c:1.164, ffs_alloc.c:1.133:
Now that Giant is acquired in uprintf() and tprintf(), the caller no longer leads to acquire Giant unless it also holds another mutex that would generate a lock order reversal when calling into these functions. Specifically not backed out is the acquisition of Giant in nfs_socket.c and rpcclnt.c, where local mutexes are held and would otherwise violate the lock order with Giant.
This aligns this code more with the eventual locking of ttys.
Suggested by: bde
|
150335 |
19-Sep-2005 |
rwatson |
Add GIANT_REQUIRED and WITNESS sleep warnings to uprintf() and tprintf(), as they both interact with the tty code (!MPSAFE) and may sleep if the tty buffer is full (per comment).
Modify all consumers of uprintf() and tprintf() to hold Giant around calls into these functions. In most cases, this means adding an acquisition of Giant immediately around the function. In some cases (nfs_timer()), it means acquiring Giant higher up in the callout.
With these changes, UFS no longer panics on SMP when either blocks are exhausted or inodes are exhausted under load due to races in the tty code when running without Giant.
NB: Some reduction in calls to uprintf() in the svr4 code is probably desirable.
NB: In the case of nfs_timer(), calling uprintf() while holding a mutex, or even in a callout at all, is a bad idea, and will generate warnings and potential upset. This needs to be fixed, but was a problem before this change.
NB: uprintf()/tprintf() sleeping is generally a bad ideas, as is having non-MPSAFE tty code.
MFC after: 1 week
|
150010 |
12-Sep-2005 |
tegge |
Giant is no longer needed here.
|
149808 |
05-Sep-2005 |
tegge |
Retain generation count when writing zeroes instead of an inode to disk.
Don't free a struct inodedep if another process is allocating saved inode memory for the same struct inodedep in initiate_write_inodeblock_ufs[12]().
Handle disappearing dependencies in softdep_disk_io_initiation().
Reviewed by: mckusick
|
149713 |
02-Sep-2005 |
ssouhlal |
ffs_mountfs() needs devvp to be locked, so lock it.
Glanced at by: phk Tested by: pjd MFC after: 3 days
|
149358 |
21-Aug-2005 |
ssouhlal |
Set the mountpoint path in the superblock (fs_fsmnt) at mount-time so that it appears in the various messages (not cleanly unmounted, filesystem full, etc). This has been broken since rev 1.261.
|
149354 |
21-Aug-2005 |
tegge |
Don't set the COMPLETE flag in an inodedep structure before the related inode has been written.
|
148608 |
31-Jul-2005 |
ups |
Delay freeing disk space for file system blocks until all dirty buffers are safely released. This fixes softdep problems on truncation (deletion) of files with dirty buffers.
Reviewed by: jeff@, mckusick@, ps@, tegge@ Tested by: glebius@, ps@ MFC after: 3 weeks
|
148200 |
20-Jul-2005 |
alc |
Eliminate inconsistency in the setting of the B_DONE flag. Specifically, make the b_iodone callback responsible for setting it if it is needed. Previously, it was set unconditionally by bufdone() without holding whichever lock is shared by the b_iodone callback and the corresponding top-half function. Consequently, in a race, the top-half function could conclude that operation was done before the b_iodone callback finished. See, for example, aio_physwakeup() and aio_fphysio().
Note: I don't believe that the other, more widely-used b_iodone callbacks are affected.
Discussed with: jeff Reviewed by: phk MFC after: 2 weeks
|
147198 |
09-Jun-2005 |
ssouhlal |
Allow EVFILT_VNODE events to work on every filesystem type, not just UFS by: - Making the pre and post hooks for the VOP functions work even when DEBUG_VFS_LOCKS is not defined. - Moving the KNOTE activations into the corresponding VOP hooks. - Creating a MNTK_NOKNOTE flag for the mnt_kern_flag field of struct mount that permits filesystems to disable the new behavior. - Creating a default VOP_KQFILTER function: vfs_kqfilter()
My benchmarks have not revealed any performance degradation.
Reviewed by: jeff, bde Approved by: rwatson, jmg (kqueue changes), grehan (mentor)
|
146802 |
30-May-2005 |
jeff |
- Don't set our bio op to be a READ when we've just completed a write. There are subtle differences in the read and write completion path. Instead, grab an extra write ref so the write path can drop it when we recursively call bufdone(). I believe this may be the source of the wrong bufobj panics.
Reported by: pho, kkenn
|
145824 |
03-May-2005 |
jeff |
- Don't restrict the softdep stats to DEBUG kernels, they cost nothing to export. This was happening anyway since this file manually sets DEBUG. - Add a sysctl for the number of items on the worklist. - Use a more canonical loop restart in softdep_fsync_mountdev, it saves some code at the expense of a goto and makes me worry less about modifying a variable that should be private to the TAILQ_FOREACH_SAFE macro.
|
145702 |
30-Apr-2005 |
jeff |
- Use bdone() directly instead of calling it indirectly through ffs_rawreaddone().
Sponsored by: Isilon Systems, Inc.
|
144659 |
05-Apr-2005 |
jeff |
- Consistently call 'vp' vp rather than ovp sometimes in ffs_truncate(). Do the same for oip.
Pointed out by: glebius
|
144590 |
03-Apr-2005 |
jeff |
- Use M_ZERO rather than explicitly calling bzero(). - Don't intermingle direct calls to lockmgr and indirect calls through VOPs. This will be important in the future. - Dont lock the devvp's interlock just to release it on the next line by passing LK_INTERLOCK to lockmgr. - Restructure ffs_snapshot_unmount so we don't call free() with the devvp's interlock locked.
|
144586 |
03-Apr-2005 |
jeff |
- In ffs_sync we need to pass LK_SLEEPFAIL in when we lock the vnode because it may change identities while we're sleeping on the lock. Otherwise we may bail out of ffs_sync() early due to an error from deadfs. - Collapse a VOP_UNLOCK, vrele into a single vput().
|
144585 |
03-Apr-2005 |
jeff |
- Move the contents of softdep_disk_prewrite into ffs_geom_strategy to fix two bugs. - ffs_disk_prewrite was pulling the vp from the buf and checking for COPYONWRITE, when really it wanted the vp from the bufobj that we're writing to, which is the devvp. This lead to us skipping the copy on write to all file data, which significantly broke snapshots for the last few months. - When the SOFTUPDATES option was not included in the kernel config we would also skip the copy on write check, which would effectively disable snapshots. - Remove an invalid mp_fixme().
Debugging tips from: mckusick Reported by: iedowse, others Discussed with: phk
|
144375 |
31-Mar-2005 |
jeff |
- FFS supports shared locks, clear LK_NOSHARE from our vnode locks.
Sponsored by: Isilon Systems, Inc.
|
144373 |
31-Mar-2005 |
jeff |
- Set LK_NOSHARE for snapshot locks. snapshots require exclusive only access. - Remove the hack from ffs_lock() to implement LK_NOSHARE in a ffs specific way.
Sponsored by: Isilon Systems, Inc.
|
144367 |
31-Mar-2005 |
jeff |
- LK_NOPAUSE is a nop now.
Sponsored by: Isilon Systems, Inc.
|
144289 |
29-Mar-2005 |
jeff |
- Upgrade a shared lock request to exclusive in ffs_vget() if we have to create the vnode.
Sponsored by: Isilon Systems, Inc.
|
144118 |
25-Mar-2005 |
das |
When the softupdates worklist gets too long, threads that attempt to add more work are forced to process two worklist items first. However, processing an item may generate additional work, causing the unlucky thread to recursively process the worklist. Add a per-thread flag to detect this situation and avoid the recursion. This should fix the stack overflows that could occur while removing large directory trees.
Tested by: kris Reviewed by: mckusick
|
143692 |
16-Mar-2005 |
phk |
Add two arguments to the vfs_hash() KPI so that filesystems which do not have unique hashes (NFS) can also use it.
|
143666 |
15-Mar-2005 |
phk |
Don't hold a reference on the disk vnode for each inode.
|
143663 |
15-Mar-2005 |
phk |
Improve the vfs_hash() API: vput() the unneeded vnode centrally to avoid replicating the vput in all the filesystems.
|
143619 |
15-Mar-2005 |
phk |
Simplify the vfs_hash calling convention.
|
143562 |
14-Mar-2005 |
phk |
Use vfs_hash instead of home-rolled.
|
143504 |
13-Mar-2005 |
jeff |
- It is not legal to access v_data without the vnode lock or interlock held. Grab the vnode interlock if LK_INTERLOCK has not been passed in so that we can inspect v_data in ffs_lock().
Sponsored by: Isilon Systems, Inc.
|
143503 |
13-Mar-2005 |
jeff |
- The VI_DOOMED flag now signals the end of a vnode's relationship with the filesystem. Check that rather than VI_XLOCK. - Shorten ffs_reload by one step. The old check for an inactive vnode was slightly racey, and the code which deals with still active vnodes is not much more expensive.
Sponsored by: Isilon Systems, Inc.
|
143502 |
13-Mar-2005 |
jeff |
- The VI_DOOMED flag now signals the end of a vnode's relationship with the filesystem. Check that rather than VI_XLOCK.
Sponsored by: Isilon Systems, Inc.
|
143501 |
13-Mar-2005 |
jeff |
- Fix an assert now that the XLOCK no longer exists.
Sponsored by: Isilon Systems, Inc.
|
142879 |
01-Mar-2005 |
jeff |
- Fix anoter dyslexic moment; an atomic_set_int should've become ACTIVESET, not ACTIVECLEAR.
Submitted by: iedowse
|
142263 |
22-Feb-2005 |
jeff |
- Add VOP locking asserts in several functions that have been implicated in recent deadlocks.
|
142123 |
20-Feb-2005 |
delphij |
The recomputation of file system summary at mount time can be a very slow process, especially for large file systems that is just recovered from a crash.
Since the summary is already re-sync'ed every 30 second, we will not lag behind too much after a crash. With this consideration in mind, it is more reasonable to transfer the responsibility to background fsck, to reduce the delay after a crash.
Add a new sysctl variable, vfs.ffs.compute_summary_at_mount, to control this behavior. When set to nonzero, we will get the "old" behavior, that the summary is computed immediately at mount time.
Add five new sysctl variables to adjust ndir, nbfree, nifree, nffree and numclusters respectively. Teach fsck_ffs about these API, however, intentionally not to check the existence, since kernels without these sysctls must have recomputed the summary and hence no adjustments are necessary.
This change has eliminated the usual tens of minutes of delay of mounting large dirty volumes.
Reviewed by: mckusick MFC After: 1 week
|
142079 |
19-Feb-2005 |
phk |
Try to unbreak the vnode locking around vop_reclaim() (based mostly on patch from kan@).
Pull bufobj_invalbuf() out of vinvalbuf() and make g_vfs call it on close. This is not yet a generally safe function, but for this very specific use it is safe. This solves the problem with buffers not being flushed by unmount or after failed mount attempts.
|
142074 |
19-Feb-2005 |
delphij |
When clearing a fragment, it's possible that the length is zero.
Reviewed by: mckusick MFC After: 1 week
|
141685 |
11-Feb-2005 |
phk |
Make non-SOFTUPDATES kernels compile again.
Integrate the stubfile into the main file now that license issues have been long resolved.
|
141631 |
10-Feb-2005 |
phk |
Make a some SYSCTL_NODEs and some of FFS's VFS_ methods static.
|
141595 |
09-Feb-2005 |
jeff |
- In the softupdates case for ffs_truncate() we use vinvalbuf() to invalidate pending io and dependencies. However, vinvalbuf() rightfully does not call vnode_pager_setsize() for us. We must do this here. This could potentially have caused numerous kinds of bugs, but it was specifically causing msync() deadlocks because msync() was writing flushing pages that should not have been valid.
Sponsored by: Isilon Systems, Inc. Reported by: kkenn
|
141570 |
09-Feb-2005 |
phk |
style polishing.
|
141542 |
08-Feb-2005 |
phk |
Split the vop_vector for ffs1 and ffs2, this is mostly for the different EXTATTR support.
|
141541 |
08-Feb-2005 |
phk |
Use ffs_truncate() directly instead of UFS_TRUNCATE()
|
141539 |
08-Feb-2005 |
phk |
Background writes are entirely an FFS/Softupdates thing.
Give FFS vnodes a specific bufwrite method which contains all the background write stuff and then calls into the default bufwrite() for the rest of the job.
Remove all the background write related stuff from the normal bufwrite.
This drags the softdep_move_dependencies() back into FFS.
Long term, it is worth looking at simply copying the data into allocated memory and issuing the bio directly and not create the "shadow buf" in the first place (just like copy-on-write is done in snapshots for instance). I don't think we really gain anything but complexity from doing this with a buf.
|
141533 |
08-Feb-2005 |
phk |
Drag another softupdates tentacle back into FFS: Now that FFS's vop_fsync is separate from the internal use we can do the full job there.
|
141526 |
08-Feb-2005 |
phk |
Don't use the UFS_* and VFS_* functions where a direct call is possble.
The UFS_ functions are for UFS to call back into VFS. The VFS functions are external entry points into the filesystem.
|
141522 |
08-Feb-2005 |
phk |
For snapshots we need all VOP_LOCKs to be exclusive.
The "business class upgrade" was implemented in UFS's VOP_LOCK implementation ufs_lock() which is the wrong layer, so move it to ffs_lock().
Also, as long as we have not abandonned advanced vfs-stacking we should not preclude it from happening: instead of implementing a copy locally, use the VOP_LOCK_APV(&ufs) to correctly arrive at vop_stdlock() at the bottom.
|
141521 |
08-Feb-2005 |
phk |
For snapshots we need all VOP_LOCKs to be exclusive.
The "business class upgrade" was implemented in UFS's VOP_LOCK implementation ufs_lock() which is the wrong layer, so move it to ffs_lock().
Also, as long as we have not abandonned advanced vfs-stacking we should not preclude it from happening: instead of implementing a copy locally, use the VOP_LOCK_APV(&ufs) to correctly arrive at vop_stdlock() at the bottom.
|
141520 |
08-Feb-2005 |
phk |
Use VOP_STRATEGY_APV() instead of direct dereference, this is more correct.
|
141150 |
02-Feb-2005 |
jeff |
- Use a seperate malloc tag for saved inode contents to help in debugging memory modified after free errors.
Sponsored by: Isilon Systems, Inc.
|
140822 |
25-Jan-2005 |
phk |
Introduce and use g_vfs_close().
|
140782 |
25-Jan-2005 |
phk |
Don't use VOP_GETVOBJECT, use vp->v_object directly.
|
140774 |
24-Jan-2005 |
phk |
Don't create vnode_pager objects for the disk device. geom_vfs will do that.
|
140709 |
24-Jan-2005 |
jeff |
- Convert the global LK lock to a mutex. - Expand the scope of lk to cover not only interrupt races, but also top-half races, which includes many new uses over global top-half only data. - Get rid of interlocked_sleep() and use msleep or BUF_LOCK where appropriate. - Use the lk mutex in place of the various hand rolled semaphores. - Stop dropping the lk lock before we panic. - Fix getdirtybuf() callers so that they reacquire access to whatever softdep datastructure they were inxpecting in the failure/retry case. Previously, sleeps in getdirtybuf() could leave us with pointers to bad memory. - Update handling of ffs to be compatible with ffs locking changes.
Sponsored By: Isilon Systems, Inc.
|
140708 |
24-Jan-2005 |
jeff |
- Initialize and destroy the per-filesystem ufs lock where appropriate. - Use the buffer lock on the superblock buf to serialize calls to sbupdate. - Set the MNTK_MPSAFE flag when QUOTA is not defined in the kernel.
Sponsored By: Isilon Systems, Inc.
|
140707 |
24-Jan-2005 |
jeff |
- Remove GIANT_REQUIRED where giant is no longer required.
Sponsored By: Isilon Systems, Inc.
|
140706 |
24-Jan-2005 |
jeff |
- Use the ufs lock to protect fs_active.
Sponsored By: Isilon Systems, Inc.
|
140705 |
24-Jan-2005 |
jeff |
- Acquire the ufs lock around several ffs_alloc functions that require it.
Sponsored By: Isilon Systems, Inc.
|
140704 |
24-Jan-2005 |
jeff |
- Don't use atomic operations to deal with the active array, instead it is now quite naturally protected by the ufsmount mutex. - Use the ufs lock to protect various fields in struct fs, primarily the cg summary needs protection to avoid allocation races. Several functions have been slightly re-arranged to reduce the number of lock operations. - Adjust several functions (blkfree, freefile, etc.) to accept a ufsmount as an argument so that we may access the ufs lock.
Sponsored By: Isilon Systems, Inc.
|
140703 |
24-Jan-2005 |
jeff |
- Acquire the ufs lock when manipulating some fields of struct fs. - Change arguments to various ffs functions to match their new prototypes.
Sponsored By: Isilon Systems, Inc.
|
140702 |
24-Jan-2005 |
jeff |
- Mark the struct fs members that require the ufsmount mutex. - Define some macros for manipulating the fs_active bitmap.
Sponsored By: Isilon Systems, Inc.
|
140701 |
24-Jan-2005 |
jeff |
- Change some function parameters so that the ufsmount structure is accessable in places where the ufs lock will be needed.
Sponsored By: Isilon Systems, Inc.
|
140306 |
15-Jan-2005 |
pjd |
Fix ACLs handling for the root file system. Without this fix, when ACLs are set via tunefs(8) on the root file system, they are removed on boot when 'mount -a' is called, because mount(8) called for the root file system always add MNT_UPDATE flag and MNT_UPDATE flag isn't perfect. Now, one cannot remove ACLs stored in superblock (configured with tunefs(8)) via 'mount -a' nor 'mount -u -o noacls <file system>', but it is still possible to mount file system which doesn't have ACLs in superblock via 'mount -o acls <file system>' or /etc/fstab's 'acls' option.
Reported by: Lech Lorens/pl.comp.os.bsd Discussed with: phk, rwatson Reviewed by: rwatson MFC after: 2 weeks
|
140220 |
14-Jan-2005 |
phk |
Eliminate unused and unnecessary "cred" argument from vinvalbuf()
|
140181 |
13-Jan-2005 |
phk |
Ditch vfs_object_create() and make the callers call VOP_CREATEVOBJECT() directly.
|
140056 |
11-Jan-2005 |
phk |
Add BO_SYNC() and add a default which uses the secret vnode pointer and VOP_FSYNC() for now.
|
140051 |
11-Jan-2005 |
phk |
Wrap the bufobj operations in macros: BO_STRATEGY() and BO_WRITE()
|
140048 |
11-Jan-2005 |
phk |
Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC().
I'm not sure why a credential was added to these in the first place, it is not used anywhere and it doesn't make much sense:
The credentials for syncing a file (ability to write to the file) should be checked at the system call level.
Credentials for syncing one or more filesystems ("none") should be checked at the system call level as well.
If the filesystem implementation needs a particular credential to carry out the syncing it would logically have to the cached mount credential, or a credential cached along with any delayed write data.
Discussed with: rwatson
|
139825 |
07-Jan-2005 |
imp |
/* -> /*- for license, minor formatting changes
|
138869 |
14-Dec-2004 |
phk |
white space
|
138744 |
12-Dec-2004 |
phk |
With the introduction of UFS2 we started looking for superblocks in four different locations on a prospective filesystem.
If we found none, we forgot to invalidate the four buffers, thus the following sequence would fails:
(md0 = blank disk) mount /dev/md0 /mnt (fails, no superblocks) newfs /dev/md0 (writes using physio which does not go through buffercache). mount /dev/md0 /mnt (still fails, the four cached buffers still contain no superblocks)
Found by: ru
|
138634 |
09-Dec-2004 |
mckusick |
Fixes a bug that caused UFS2 filesystems bigger than 2TB to prematurely report that they were full and/or to panic the kernel with the message ``ffs_clusteralloc: allocated out of group''.
Submitted by: Henry Whincup <henry@jot.to> MFC after: 1 week
|
138557 |
08-Dec-2004 |
phk |
Fix snapshot creation.
|
138517 |
07-Dec-2004 |
phk |
Fix nfs exports (for now). The real fix is to teach mountd about nmount.
|
138509 |
07-Dec-2004 |
phk |
The remaining part of nmount/omount/rootfs mount changes. I cannot sensibly split the conversion of the remaining three filesystems out from the root mounting changes, so in one go:
cd9660: Convert to nmount. Add omount compat shims. Remove dedicated rootfs mounting code. Use vfs_mountedfrom() Rely on vfs_mount.c calling VFS_STATFS()
nfs(client): Convert to nmount (the simple way, mount_nfs(8) is still necessary). Add omount compat shims. Drop COMPAT_PRELITE2 mount arg compatibility.
ffs: Convert to nmount. Add omount compat shims. Remove dedicated rootfs mounting code. Use vfs_mountedfrom() Rely on vfs_mount.c calling VFS_STATFS()
Remove vfs_omount() method, all filesystems are now converted.
Remove MNTK_WANTRDWR, handling RO/RW conversions is a filesystem task, and they all do it now.
Change rootmounting to use DEVFS trampoline:
vfs_mount.c: Mount devfs on /. Devfs needs no 'from' so this is clean. symlink /dev to /. This makes it possible to lookup /dev/foo. Mount "real" root filesystem on /. Surgically move the devfs mountpoint from under the real root filesystem onto /dev in the real root filesystem.
Remove now unnecessary getdiskbyname().
kern_init.c: Don't do devfs mounting and rootvnode assignment here, it was already handled by vfs_mount.c.
Remove now unused bdevvp(), addaliasu() and addalias(). Put the few necessary lines in devfs where they belong. This eliminates the second-last source of bogo vnodes, leaving only the lemming-syncer.
Remove rootdev variable, it doesn't give meaning in a global context and was not trustworth anyway. Correct information is provided by statfs(/).
|
138412 |
05-Dec-2004 |
phk |
VFS_STATFS(mp, ...) is mostly called with &mp->mnt_stat, but a few cases doesn't. Most of the implementations have grown weeds for this so they copy some fields from mnt_stat if the passed argument isn't that.
Fix this the cleaner way: Always call the implementation on mnt_stat and copy that in toto to the VFS_STATFS argument if different.
|
138359 |
03-Dec-2004 |
phk |
typo in comment.
|
138290 |
01-Dec-2004 |
phk |
Back when VOP_* was introduced, we did not have new-style struct initializations but we did have lofty goals and big ideals.
Adjust to more contemporary circumstances and gain type checking.
Replace the entire vop_t frobbing thing with properly typed structures. The only casualty is that we can not add a new VOP_ method with a loadable module. History has not given us reason to belive this would ever be feasible in the the first place.
Eliminate in toto VOCALL(), vop_t, VNODEOP_SET() etc.
Give coda correct prototypes and function definitions for all vop_()s.
Generate a bit more data from the vnode_if.src file: a struct vop_vector and protype typedefs for all vop methods.
Add a new vop_bypass() and make vop_default be a pointer to another struct vop_vector.
Remove a lot of vfs_init since vop_vector is ready to use from the compiler.
Cast various vop_mumble() to void * with uppercase name, for instance VOP_PANIC, VOP_NULL etc.
Implement VCALL() by making vdesc_offset the offsetof() the relevant function pointer in vop_vector. This is disgusting but since the code is generated by a script comparatively safe. The alternative for nullfs etc. would be much worse.
Fix up all vnode method vectors to remove casts so they become typesafe. (The bulk of this is generated by scripts)
|
138270 |
01-Dec-2004 |
phk |
Mechanically change prototypes for vnode operations to use the new typedefs.
|
138075 |
25-Nov-2004 |
phk |
Use system wide no-op vfs_start function.
|
137846 |
18-Nov-2004 |
jeff |
- Eliminate the acquisition and release of the bqlock in bremfree() by setting the B_REMFREE flag in the buf. This is done to prevent lock order reversals with code that must call bremfree() with a local lock held. This also reduces overhead by removing two lock operations per buf for fsync() and similar. - Check for the B_REMFREE flag in brelse() and bqrelse() after the bqlock has been acquired so that we may remove ourself from the free-list. - Provide a bremfreef() function to immediately remove a buf from a free-list for use only by NFS. This is done because the nfsclient code overloads the b_freelist queue for its own async. io queue. - Simplify the numfreebuffers accounting by removing a switch statement that executed the same code in every possible case. - getnewbuf() can encounter locked bufs on free-lists once Giant is removed. Remove a panic associated with this condition and delay asserts that inspect the buf until after it is locked.
Reviewed by: phk Sponsored by: Isilon Systems, Inc.
|
137657 |
13-Nov-2004 |
phk |
Be prepared to accept NULL mountargs as part of root-mounting.
|
137608 |
12-Nov-2004 |
phk |
Put back the vfs_object_create() calls, they do make a difference when my test-setup does what I want it to instead of what I ask it to.
Pointed out by: tegge
|
137504 |
10-Nov-2004 |
phk |
fix some comments
|
137491 |
09-Nov-2004 |
phk |
Use mount flags instead of NULL path to detect root filesystem mount.
|
137486 |
09-Nov-2004 |
phk |
Stop pretending to have a vm_object backing the underlying disk vnode: it isn't used for anything anywhere and the vnode_pager would explode if we attempted to.
|
137194 |
04-Nov-2004 |
phk |
Don't grab the exclusive bit on a root filesystem until we are willing to mount it. Doing so prevented fsck to be run after a refused mount.
|
137035 |
29-Oct-2004 |
phk |
Move UFS from DEVFS backing to GEOM backing.
This eliminates a bunch of vnode overhead (approx 1-2 % speed improvement) and gives us more control over the access to the storage device.
Access counts on the underlying device are not correctly tracked and therefore it is possible to read-only mount the same disk device multiple times: syv# mount -p /dev/md0 /var ufs rw 2 2 /dev/ad0 /mnt ufs ro 1 1 /dev/ad0 /mnt2 ufs ro 1 1 /dev/ad0 /mnt3 ufs ro 1 1
Since UFS/FFS is not a synchrousely consistent filesystem (ie: it caches things in RAM) this is not possible with read-write mounts, and the system will correctly reject this.
Details:
Add a geom consumer and a bufobj pointer to ufsmount.
Eliminate the vnode argument from softdep_disk_prewrite(). Pick the vnode out of bp->b_vp for now. Eventually we should find it through bp->b_bufobj->b_private.
In the mountcode, use g_vfs_open() once we have used VOP_ACCESS() to check permissions.
When upgrading and downgrading between r/o and r/w do the right thing with GEOM access counts. Remove all the workarounds for not being able to do this with VOP_OPEN().
If we are the root mount, drop the exclusive access count until we upgrade to r/w. This allows fsck of the root filesystem and the MNT_RELOAD to work correctly.
Set bo_private to the GEOM consumer on the device bufobj.
Change the ffs_ops->strategy function to call g_vfs_strategy()
In ufs_strategy() directly call the strategy on the disk bufobj. Same in rawread.
In ffs_fsync() we will no longer see VCHR device nodes, so remove code which synced the filesystem mounted on it, in case we came there. I'm not sure this code made sense in the first place since we would have taken the specfs route on such a vnode.
Redo the highly bogus readblock() function in the snapshot code to something slightly less bogus: Constructing an uio and using physio was really quite a detour. Instead just fill in a bio and ship it down.
|
137007 |
28-Oct-2004 |
phk |
We only support backing UFS/FFS with disks.
|
136988 |
27-Oct-2004 |
phk |
Eliminate unnecessary KASSERTS.
|
136982 |
26-Oct-2004 |
phk |
KASSERT that we only get to prewrite() on writes.
|
136981 |
26-Oct-2004 |
phk |
White space changes. Add missing static.
|
136969 |
26-Oct-2004 |
phk |
The island council met and voted buf_prewrite() home.
Give ffs it's own bufobj->bo_ops vector and create a private strategy routine, (currently misnamed for forwards compatibility), which is just a copy of the generic bufstrategy routine except we call softdep_disk_prewrite() directly instead of through the buf_prewrite() indirection.
Teach UFS about the need for softdep_disk_prewrite() and call the function directly in FFS.
Remove buf_prewrite() from the default bufstrategy() and from the global bio_ops method vector.
|
136968 |
26-Oct-2004 |
phk |
Fix syntax errors introduced by last commit.
Why isn't DIRECTIO in NOTES/LINT ?
|
136966 |
26-Oct-2004 |
phk |
Put the I/O block size in bufobj->bo_bsize.
We keep si_bsize_phys around for now as that is the simplest way to pull the number out of disk device drivers in devfs_open(). The correct solution would be to do an ioctl(DIOCGSECTORSIZE), but the point is probably mooth when filesystems sit on GEOM, so don't bother for now.
|
136963 |
26-Oct-2004 |
phk |
Degeneralize the per cdev copyonwrite callback. The only possible value is ffs_copyonwrite() and the only place it can be called from is FFS which would never want to call another filesystems copyonwrite method, should one exist, so there is no reason why anything generic should know about this.
|
136943 |
25-Oct-2004 |
phk |
Loose the v_dirty* and v_clean* alias macros.
Check the count field where we just want to know the full/empty state, rather than using TAILQ_EMPTY() or TAILQ_FIRST().
|
136941 |
25-Oct-2004 |
phk |
Remove vnode->v_bsize. This was a dead-end.
|
136927 |
24-Oct-2004 |
phk |
Move the buffer method vector (buf->b_op) to the bufobj.
Extend it with a strategy method.
Add bufstrategy() which do the usual VOP_SPECSTRATEGY/VOP_STRATEGY song and dance.
Rename ibwrite to bufwrite().
Move the two NFS buf_ops to more sensible places, add bufstrategy to them.
Add inlines for bwrite() and bstrategy() which calls through buf->b_bufobj->b_ops->b_{write,strategy}().
Replace almost all VOP_STRATEGY()/VOP_SPECSTRATEGY() calls with bstrategy().
|
136767 |
22-Oct-2004 |
phk |
Add b_bufobj to struct buf which eventually will eliminate the need for b_vp.
Initialize b_bufobj for all buffers.
Make incore() and gbincore() take a bufobj instead of a vnode.
Make inmem() local to vfs_bio.c
Change a lot of VI_[UN]LOCK(bp->b_vp) to BO_[UN]LOCK(bp->b_bufobj) also VI_MTX() to BO_MTX(),
Make buf_vlist_add() take a bufobj instead of a vnode.
Eliminate other uses of bp->b_vp where bp->b_bufobj will do.
Various minor polishing: remove "register", turn panic into KASSERT, use new function declarations, TAILQ_FOREACH_SAFE() etc.
|
136751 |
21-Oct-2004 |
phk |
Move the VI_BWAIT flag into no bo_flag element of bufobj and call it BO_WWAIT
Add bufobj_wref(), bufobj_wdrop() and bufobj_wwait() to handle the write count on a bufobj. Bufobj_wdrop() replaces vwakeup().
Use these functions all relevant places except in ffs_softdep.c where the use if interlocked_sleep() makes this impossible.
Rename b_vnbufs to b_bobufs now that we touch all the relevant files anyway.
|
136721 |
20-Oct-2004 |
rwatson |
Explicitly break out NETA license from Berkeley license to clearly indicate license grant, as well as to indicate that NETA is asserting only two clauses, not four clauses.
Requested by: imp
|
136336 |
09-Oct-2004 |
njl |
Fix fsbtodb() for UFS1. This fixes an overflow for file sizes >1 TB, allowing for sizes up to 4 TB. This doesn't affect UFS2 since b is already a 64 bit type, coincidental with daddr_t.
Submitted by: bde
|
136144 |
05-Oct-2004 |
pjd |
Back out changes which were introduced to delay mounting root file system. Those changes were made on gmirror needs, but now gmirror handles this by itself.
|
135877 |
28-Sep-2004 |
phk |
Remove support for accessing device nodes in UFS/FFS.
Device nodes can still be created and exported with NFS.
|
135858 |
27-Sep-2004 |
phk |
Give cluster_write() an explicit vnode argument.
In the future a struct buf will not automatically point out a vnode for us.
|
135612 |
23-Sep-2004 |
pjd |
Introduce new /boot/loader.conf variable: root_mount_delay. It can be used to delay mounting root partition to give a chance to GEOM providers to show up. Now, when there is no needed provider, vfs_rootmount() function will look for it every second and if it can't be find in defined time, it'll ask for root device name (before this change it was done immediately).
This will allow to boot from gmirror device in degraded mode.
|
135459 |
19-Sep-2004 |
phk |
The getpages VOP was a good stab at getting scatter/gather I/O without too much kernel copying, but it is not the right way to do it, and it is in the way for straightening out the buffer cache.
The right way is to pass the VM page array down through the struct bio to the disk device driver and DMA directly in to/out off the physical memory. Once the VM/buf thing is sorted out it is next on the list.
Retire most of vnode method. ffs_getpages(). It is not clear if what is left shouldn't be in the default implementation which we now fall back to.
Retire specfs_getpages() as well, as it has no users now.
|
135312 |
16-Sep-2004 |
phk |
Do not traverse list of snapshots if there isn't one.
Found by: scottl
|
135303 |
16-Sep-2004 |
phk |
Missed a place where snapshots were allocated in my last commit to this file.
|
135138 |
13-Sep-2004 |
phk |
Create struct snapdata which contains the snapshot fields from cdev and the previously malloc'ed snapshot lock.
Malloc struct snapdata instead of just the lock.
Replace snapshot fields in cdev with pointer to snapdata (saves 16 bytes).
While here, give the private readblock() function a vnode argument in preparation for moving UFS to access GEOM directly.
|
135135 |
13-Sep-2004 |
phk |
Remove the buffercache/vnode side of BIO_DELETE processing in preparation for integration of p4::phk_bufwork. In the future, local filesystems will talk to GEOM directly and they will consequently be able to issue BIO_DELETE directly. Since the removal of the fla driver, BIO_DELETE has effectively been a no-op anyway.
|
134011 |
19-Aug-2004 |
jhb |
Generalize the UFS bad magic value used to determine when a filesystem has only been partly initialized via newfs(8) so that it applies to both UFS1 and UFS2.
Submitted by: "Xin LI" delphij at frontfree dot net MFC: maybe?
|
133741 |
15-Aug-2004 |
jmg |
Add locking to the kqueue subsystem. This also makes the kqueue subsystem a more complete subsystem, and removes the knowlege of how things are implemented from the drivers. Include locking around filter ops, so a module like aio will know when not to be unloaded if there are outstanding knotes using it's filter ops.
Currently, it uses the MTX_DUPOK even though it is not always safe to aquire duplicate locks. Witness currently doesn't support the ability to discover if a dup lock is ok (in some cases).
Reviewed by: green, rwatson (both earlier versions)
|
133327 |
08-Aug-2004 |
phk |
use bufdone() not biodone().
|
132902 |
30-Jul-2004 |
phk |
Put a version element in the VFS filesystem configuration structure and refuse initializing filesystems with a wrong version. This will aid maintenance activites on the 5-stable branch.
s/vfs_mount/vfs_omount/
s/vfs_nmount/vfs_mount/
Name our filesystems mount function consistently.
Eliminate the namiedata argument to both vfs_mount and vfs_omount. It was originally there to save stack space. A few places abused it to get hold of some credentials to pass around. Effectively it is unused.
Reorganize the root filesystem selection code.
|
132805 |
28-Jul-2004 |
phk |
Remove global variable rootdevs and rootvp, they are unused as such.
Add local rootvp variables as needed.
Remove checks for miniroot's in the swappartition. We never did that and most of the filesystems could never be used for that, but it had still been copy&pasted all over the place.
|
132775 |
28-Jul-2004 |
kan |
Avoid using casts as lvalues. Introduce DIP_SET macro which sets proper inode field based on UFS version. Use DIP ro read values and DIP_SET to modify them throughout FFS code base.
|
132653 |
26-Jul-2004 |
cperciva |
Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is somewhat clearer, but more importantly allows for a consistent naming scheme for suser_cred flags.
The old name is still defined, but will be removed in a few days (unless I hear any complaints...)
Discussed with: rwatson, scottl Requested by: jhb
|
132154 |
14-Jul-2004 |
phk |
Make sure to update the mnt_stats before UFS1 extattr tried to do I/O on the device. Otherwise the blocksize is undefined in the buffer cache.
|
132023 |
12-Jul-2004 |
alfred |
Make VFS_ROOT() and vflush() take a thread argument. This is to allow filesystems to decide based on the passed thread which vnode to return. Several filesystems used curthread, they now use the passed thread.
|
131907 |
10-Jul-2004 |
marcel |
Update for the KDB debugger framework: o Make debugging code conditional upon KDB. o Use kdb_backtrace() instead of backtrace(). o Remove inclusion of opt_ddb.h.
|
131756 |
07-Jul-2004 |
phk |
Explicity initialize vp->v_bsize.
|
131551 |
04-Jul-2004 |
phk |
When we traverse the vnodes on a mountpoint we need to look out for our cached 'next vnode' being removed from this mountpoint. If we find that it was recycled, we restart our traversal from the start of the list.
Code to do that is in all local disk filesystems (and a few other places) and looks roughly like this:
MNT_ILOCK(mp); loop: for (vp = TAILQ_FIRST(&mp...); (vp = nvp) != NULL; nvp = TAILQ_NEXT(vp,...)) { if (vp->v_mount != mp) goto loop; MNT_IUNLOCK(mp); ... MNT_ILOCK(mp); } MNT_IUNLOCK(mp);
The code which takes vnodes off a mountpoint looks like this:
MNT_ILOCK(vp->v_mount); ... TAILQ_REMOVE(&vp->v_mount->mnt_nvnodelist, vp, v_nmntvnodes); ... MNT_IUNLOCK(vp->v_mount); ... vp->v_mount = something;
(Take a moment and try to spot the locking error before you read on.)
On a SMP system, one CPU could have removed nvp from our mountlist but not yet gotten to assign a new value to vp->v_mount while another CPU simultaneously get to the top of the traversal loop where it finds that (vp->v_mount != mp) is not true despite the fact that the vnode has indeed been removed from our mountpoint.
Fix:
Introduce the macro MNT_VNODE_FOREACH() to traverse the list of vnodes on a mountpoint while taking into account that vnodes may be removed from the list as we go. This saves approx 65 lines of duplicated code.
Split the insmntque() which potentially moves a vnode from one mount point to another into delmntque() and insmntque() which does just what the names say.
Fix delmntque() to set vp->v_mount to NULL while holding the mountpoint lock.
|
130690 |
18-Jun-2004 |
kuriyama |
Avoid deadlock which is caused by locking VDIR of parent and VREG of snapshot itself in wrong order. We can skip unlink check of that directory because it must have snapshot in it.
Reviewed by: mckusick and current@
|
130585 |
16-Jun-2004 |
phk |
Do the dreaded s/dev_t/struct cdev */ Bump __FreeBSD_version accordingly.
|
130551 |
16-Jun-2004 |
julian |
Nice, is a property of a process as a whole.. I mistakenly moved it to the ksegroup when breaking up the process structure. Put it back in the proc structure.
|
130246 |
08-Jun-2004 |
stefanf |
Avoid assignments to cast expressions.
Reviewed by: md5 Approved by: das (mentor)
|
130023 |
03-Jun-2004 |
tjr |
Move TDF_DEADLKTREAT into td_pflags (and rename it accordingly) to avoid having to acquire sched_lock when manipulating it in lockmgr(), uiomove(), and uiomove_fromphys().
Reviewed by: jhb
|
129895 |
31-May-2004 |
krion |
- Fix typo
Approved by: tobez
|
129545 |
21-May-2004 |
kensmith |
Upon further review it was decided this piece of the msync(2) fixes was applicable to HEAD, originally it was thought this should only be done in RELENG_4. Implement IO_INVAL in the vnode op for writing by marking the buffer as "no cache". This fix has already been applied to RELENG_4 as Rev. 1.65.2.15 of ufs/ufs/ufs_readwrite.c.
Reviewed by: alc, tegge
|
129450 |
19-May-2004 |
kensmith |
Style fixup in previous commit.
Noticed by: bde (thanks!)
|
129244 |
14-May-2004 |
kensmith |
Change ffs_realloccg() to set the valid bits for the extended part of the fragment to zero the valid parts of a VM_IO buffer.
RE would like this to be part of 4.10-RC3 so this will be MFC-ed immediately.
Reviewed by: alc, tegge
|
128740 |
29-Apr-2004 |
bmilekic |
Revert previous change to this file because it breaks some things which compare /etc/fstab entries to results from getfsstat(). The real way to fix this is to make 'ufs2' a recognized filesystem (for real, no beating around the bush).
This should fix things like 'umount -a -t ufs' now. Appologies for the previous breakage.
|
128658 |
26-Apr-2004 |
bmilekic |
The previous change to mount(8) to report ufs or ufs2 used libufs, which only works for Charlie root.
This change reverts the introduction of libufs and moves the check into the kernel. Since the f_fstypename is the same for both ufs and ufs2, we check fs_magic for presence of ufs2 and copy "ufs2" explicitly instead.
Submitted by: Christian S.J. Peron <maneo@bsdpro.com>
|
128006 |
07-Apr-2004 |
bde |
Record where half the bits in this file came from (from ufs_readwrite.c). Damage to history from moving bits was especially large since a repo copy is not feasible for partial files.
|
127975 |
07-Apr-2004 |
imp |
Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999 and irc message from Robert Watson saying that clause 3 can be removed from those files with an NAI copyright that also have only a University of California copyrights.
Approved by: core, rwatson
|
127955 |
06-Apr-2004 |
jhb |
Fix a paste-o from the buf_prewrite() cleanup commit and check for the MNTK_SUSPEND flag on the correct vnode pointer in softdep_disk_prewrite().
Reviewed by: phk Tested by: kensmith
|
127818 |
03-Apr-2004 |
mux |
Fix the remaining warnings of growfs(8) on my sparc64 box with WARNS=6. I don't change the WARNS level in the Makefile because I didn't tested this on other archs.
The fs.h fix was suggested by: marcel Reviewed by: md5(1)
|
127095 |
16-Mar-2004 |
kan |
Avoid doing bawrite to initialize inode block while holding cylinder group block locked. If filesystem has any active snapshots, bawrite can come back trying to allocate new snapshot data block from the same cylinder group and cause panic due to recursive lock attempt.
PR: 64206 Reviewed by: mckusick Tested by: pjd
|
126858 |
11-Mar-2004 |
phk |
When I was a kid my work table was one cluttered mess an cleaning it up were a rather overwhelming task. I soon learned that if you don't know where you're going to store something, at least try to pile it next to something slightly related in the hope that a pattern emerges.
Apply the same principle to the ffs/snapshot/softupdates code which have leaked into specfs: Add yet a buf-quasi-method and call it from the only two places I can see it can make a difference and implement the magic in ffs_softdep.c where it belongs.
It's not pretty, but at least it's one less layer violated.
|
126853 |
11-Mar-2004 |
phk |
Properly vector all bwrite() and BUF_WRITE() calls through the same path and s/BUF_WRITE()/bwrite()/ since it now does the same as bwrite().
|
126154 |
23-Feb-2004 |
mckusick |
In the function clear_inodedeps(), a FREE_LOCK() should be called AFTER the call to vn_start_write(), not before it. Otherwise, it is possible to unlock it multiple times if the vn_start_write() fails.
Submitted by: Juergen Hannken-Illjes <hannken@eis.cs.tu-bs.de>
|
125796 |
14-Feb-2004 |
bde |
Fixed some style bugs: - don't unlock the vnode after vinvalbuf() only to have to relock it almost immediately. - don't refer to devices classified by vn_isdisk() as block devices.
|
125765 |
13-Feb-2004 |
bde |
MFextfs: backed out secondary changes in rev.1.40 that had become just style bugs (a variable that is used only once, and misformattings).
|
125764 |
13-Feb-2004 |
kuriyama |
Fix style bugs in previous commit.
Submitted by: bde
|
125738 |
12-Feb-2004 |
bde |
Fixed some minor style bugs (English usage and formatting of binary operators) in and near revs.1.169-1.170 (open mode bandaid). This (or better a proper fix) should have been done before cloning the bandaid to many other file systems.
|
125732 |
12-Feb-2004 |
kuriyama |
Reverse lock order by using local variable. This will shut up "acquiring duplicate lock of same type" message.
Reviewed by: mckusick
|
125710 |
11-Feb-2004 |
bde |
Removed more vestiges of vfs_ioopt: - rev.1.42 of ffs_readwrite.c added a special case in ffs_read() for reads that are initially at EOF, and rev.1.62 of ufs_readwrite.c fixed timestamp bugs in it. Removal of most of vfs_ioopt made it just and optimization, and removal of the vm object reference calls made it less than an optimization. It was cloned in rev.1.94 of ufs_readwrite.c as part of cloning ffs_extwrite() although it was always less than an optimization in ffs_extwrite(). - some comments, compound statements and vertical whitespace were vestiges of dead code.
|
125454 |
04-Feb-2004 |
jhb |
Locking for the per-process resource limits structure. - struct plimit includes a mutex to protect a reference count. The plimit structure is treated similarly to struct ucred in that is is always copy on write, so having a reference to a structure is sufficient to read from it without needing a further lock. - The proc lock protects the p_limit pointer and must be held while reading limits from a process to keep the limit structure from changing out from under you while reading from it. - Various global limits that are ints are not protected by a lock since int writes are atomic on all the archs we support and thus a lock wouldn't buy us anything. - All accesses to individual resource limits from a process are abstracted behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return either an rlimit, or the current or max individual limit of the specified resource from a process. - dosetrlimit() was renamed to kern_setrlimit() to match existing style of other similar syscall helper functions. - The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit() (it didn't used the stackgap when it should have) but uses lim_rlimit() and kern_setrlimit() instead. - The svr4 compat no longer uses the stackgap for resource limits calls, but uses lim_rlimit() and kern_setrlimit() instead. - The ibcs2 compat no longer uses the stackgap for resource limits. It also no longer uses the stackgap for accessing sysctl's for the ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result, ibcs2_sysconf() no longer needs Giant. - The p_rlimit macro no longer exists.
Submitted by: mtm (mostly, I only did a few cleanups and catchups) Tested on: i386 Compiled on: alpha, amd64
|
125259 |
31-Jan-2004 |
alc |
Remove unnecessary vm object reference and deallocate calls from ffs_read() and ffs_write(). These calls trace their origins to the dead vfs_ioopt code, first appearing in revision 1.39 of ufs_readwrite.c.
Observed by: bde Discussed with: tegge
|
125079 |
27-Jan-2004 |
ache |
Turn uio_resid/uio_offset comments into KASSERTs
Reviewed by: bde
|
124857 |
23-Jan-2004 |
ache |
Copy comment about caller check from ffs_read to ffs_extread, don't check for uio_resid < 0 here too.
|
124856 |
23-Jan-2004 |
ache |
Fix various panic() strings to reflect true function name to allow easy grep. Small code reorganization to look more logic. Copy ffs_write check from prev. commit to ffs_extwrite.
|
124855 |
23-Jan-2004 |
ache |
ffs_read: Replace wrong check returned EFBIG with EOVERFLOW handling from POSIX:
36708 [EOVERFLOW] The file is a regular file, nbyte is greater than 0, the starting position is before the end-of-file, and the starting position is greater than or equal to the offset maximum established in the open file description associated with fildes.
ffs_write: Replace u_int64_t cast with uoff_t cast which is more natural for types used.
ffs_write & ffs_read: Remove uio_offset and uio_resid checks for negative values, the caller supposed to do it already. Add comments about it.
Reviewed by: bde
|
124728 |
19-Jan-2004 |
kan |
Spell magic '16' number as IO_SEQSHIFT.
|
124119 |
04-Jan-2004 |
kan |
Avoid calling vprint on a vnode while holding its interlock mutex. Move diagnostic printf after vget. This might delay the debug output some, but at least it keeps kernel from exploding if DEBUG_VFS_LOCKS is in effect.
|
123217 |
07-Dec-2003 |
truckman |
Set fs_ronly to the correct value in ffs_reload() when reloading the file system super block after fsck has repaired the file system. The value of fs_ronly was getting overwritten, which caused ffs_update() to attempt to update inode timestamps even though the file system was still mounted read-only.
This fixes the "giving up on N buffers" error that is triggered by running fsck on the root file system and then rebooting without mounting the file system read-write.
|
122783 |
16-Nov-2003 |
wes |
Write the UFS2 superblock with a 'BAD' magic number at the beginning of newfs, to signify the newfs operation has not yet completed. Re- write the superblock with the correct magic number once all of the cylinder groups have been created to show the operation has finished.
Sponsored by: St. Bernard Software
|
122747 |
15-Nov-2003 |
phk |
Send B_PHYS out to pasture, it no longer serves any function.
|
122596 |
13-Nov-2003 |
alc |
Call free(9) after the vnode interlock is released, avoiding a lock-order reversal.
|
122537 |
12-Nov-2003 |
mckusick |
Update the statfs structure with 64-bit fields to allow accurate reporting of multi-terabyte filesystem sizes.
You should build and boot a new kernel BEFORE doing a `make world' as the new kernel will know about binaries using the old statfs structure, but an old kernel will not know about the new system calls that support the new statfs structure. Running an old kernel after a `make world' will cause programs such as `df' that do a statfs system call to fail with a bad system call.
Reviewed by: Bruce Evans <bde@zeta.org.au> Reviewed by: Tim Robbins <tjr@freebsd.org> Reviewed by: Julian Elischer <julian@elischer.org> Reviewed by: the hoards of <arch@freebsd.org> Sponsored by: DARPA & NAI Labs.
|
122091 |
05-Nov-2003 |
kan |
Remove mntvnode_mtx and replace it with per-mountpoint mutex. Introduce two new macros MNT_ILOCK(mp)/MNT_IUNLOCK(mp) to operate on this mutex transparently.
Eventually new mutex will be protecting more fields in struct mount, not only vnode list.
Discussed with: jeff
|
121925 |
03-Nov-2003 |
kan |
Use VOP_UNLOCK/vrele instead of vput. td was erecived as a parameter and one cannot be sure it is equal to curthread.
|
121874 |
02-Nov-2003 |
kan |
Take care not to call vput if thread used in corresponding vget wasn't curthread, i.e. when we receive a thread pointer to use as a function argument. Use VOP_UNLOCK/vrele in these cases.
The only case there td != curthread known at the moment is boot() calling sync with thread0 pointer.
This fixes the panic on shutdown people have reported.
|
121847 |
01-Nov-2003 |
kan |
Temporarily undo parts of the stuct mount locking commit by jeff. It is unsafe to hold a mutex across vput/vrele calls.
This will be redone when a better locking strategy is agreed upon.
Discussed with: jeff
|
121785 |
31-Oct-2003 |
truckman |
Tweak the calculation of minbfree in ffs_dirpref() so that only those cylinder groups that have at least 75% of the average free space per cylinder group for that file system are considered as candidates for the creation of a new directory. The previous formula for minbfree would set it to zero if the file system was more than 75% full, which allowed cylinder groups with no free space at all to be chosen as candidates for directory creation, which resulted in an expensive search for free blocks for each file that was subsequently created in that directory.
Modify the calculation of minifree in the same way.
Decrease maxcontigdirs as the file system fills to decrease the likelyhood that a cluster of directories will overflow the available space in a cylinder group.
Reviewed by: mckusick Tested by: kmarx@vicor.com MFC after: 2 weeks
|
121443 |
23-Oct-2003 |
jhb |
Move the P_COWINPROGRESS flag from being a per-process p_flag to being a per-thread td_pflag which doesn't require any locks to read or write as it is only read or written by curthread on itself.
Glanced at by: mckusick
|
121354 |
22-Oct-2003 |
tegge |
Initialize bp->b_offset to the physical offset in partition so GEOM knows where to read from disk.
|
121205 |
18-Oct-2003 |
phk |
DuH!
bp->b_iooffset (the spot on the disk), not bp->b_offset (the offset in the file)
|
121202 |
18-Oct-2003 |
phk |
Initialize bp->b_offset before calling VOP_[SPEC]STRATEGY()
|
121158 |
17-Oct-2003 |
mckusick |
When expunging unlinked files from a snapshot, skip over holes in the file rather than panicing with "indiracct: botched params".
Submitted by: Mark Santcroos <marks@ripe.net>
|
120841 |
06-Oct-2003 |
jeff |
- My last commit to this file is still not safe, I believe that it may be due to the recursion in indir_trunc().
|
120839 |
06-Oct-2003 |
jeff |
- Reinstate 1.142 this was fixed by 1.144.
|
120825 |
05-Oct-2003 |
jeff |
- The VCHR case in ffs_sync() is an unneccsary optimization especially considering how infrequently we access devices via ffs now that we have devfs. Collapse this case with the other case.
Obtained from: bde
|
120805 |
05-Oct-2003 |
jeff |
- Further simplify ffs_sync(). The vnode lock is required for UFS_UPDATE() so make the code slightly more uniform. The vnode lock is acquired in all cases and now the only difference between VCHR and other is we call UFS_UPDATE instead of VOP_FSYNC().
|
120804 |
05-Oct-2003 |
jeff |
- In ffs_update() assert that either the vnode lock or the XLOCK is held.
|
120793 |
05-Oct-2003 |
jeff |
- Check the XLOCK before inspecting v_data. - Slightly rewrite the fsync loop to be more lock friendly. We must acquire the vnode interlock before dropping the mnt lock. We must also check XLOCK to prevent vclean() races. - Use LK_INTERLOCK in the vget() in ffs_sync to further prevent vclean() races. - Use a local variable to store the results of the nvp == TAILQ_NEXT test so that we do not access the vp after we've vrele()d it. - Add an XXX comment about UFS_UPDATE() not being protected by any lock here. I suspect that it should need the VOP lock.
|
120789 |
05-Oct-2003 |
jeff |
- Skip over xvp if XLOCK is set.
|
120763 |
04-Oct-2003 |
alc |
Synchronize access to a vm page's valid field using the containing vm object's lock.
|
120750 |
04-Oct-2003 |
jeff |
- The VI assert in getdirtybuf() is only valid if we're not on a VCHR vnode. VCHR vnodes don't do background writes.
Reported by: kan
|
120741 |
04-Oct-2003 |
jeff |
- Increase the scope of the interlock in ffs_reload(). Acquire it before we release the mntvnode_mtx. - Call vgonel() directly instead of going through vrecycle() since we own the interlock now. - Remove a few cases where we locked the interlock just so that we could call VOP_UNLOCK with interlock held.
|
120740 |
04-Oct-2003 |
jeff |
- Fix an unlocked call to GETATTR by slightly shuffling the code in ffs_snapshot() around. - Acquire the interlock before releasing the mntvnode_mtx. Use the interlock to protect v_usecount access.
|
120732 |
04-Oct-2003 |
jeff |
- Remove a mp_fixme() and some locks that weren't necessary. I now understand how this works.
|
119707 |
03-Sep-2003 |
jeff |
- Several of the callers to getdirtybuf() were erroneously changed to pass in a list head instead of a pointer to the first element at the time of the first call. These lists are subject to change, and getdirtybuf() would refetch from the wrong list in some cases.
Spottedy by: tegge Pointy hat to: me
|
119604 |
31-Aug-2003 |
jeff |
- Backout rev 1.142. This caused a deadlock that I do not understand. More investigation is required.
|
119603 |
31-Aug-2003 |
jeff |
- Define a new flag for getblk(): GB_NOCREAT. This flag causes getblk() to bail out if the buffer is not already present. - The buffer returned by incore() is not locked and should not be sent to brelse(). Use getblk() with the new GB_NOCREAT flag to preserve the desired semantics.
|
119601 |
31-Aug-2003 |
jeff |
- Don't acquire the vnode interlock in drain_output(). Instead, require the caller to acquire it. This permits drain_output() to be done atomically with other operations as well as reducing the number of lock operations. - Assert that the proper locks are held in drain_output(). - Change getdirtybuf() to accept a mutex as an argument. This mutex is used to protect the vnode's buf list and the BKGRDWAIT flag. This lock is dropped when we successfully acquire a buffer and held on return otherwise. These semantics reduce the number of cumbersome cases in calling code. - Pass the mtx from getdirtybuf() into interlocked_sleep() and allow this mutex to be used as the interlock argument to BUF_LOCK() in the LOCKBUF case of interlocked_sleep(). - Change the return value of getdirtybuf() to be the resulting locked buffer or NULL otherwise. This is for callers who pass in a list head that requires a lock. It is necessary since the lock that protects the list head must be dropped in getdirtybuf() so that we don't have a lock order reversal with the buf queues lock in bremfree(). - Adjust all callers of getdirtybuf() to match the new semantics. - Add a comment in indir_trunc() that points at unlocked access to a buf. This may also be one of the last instances of incore() in the tree.
|
119521 |
28-Aug-2003 |
jeff |
- Move BX_BKGRDWAIT and BX_BKGRDINPROG to BV_ and the b_vflags field. - Surround all accesses of the BKGRD{WAIT,INPROG} flags with the vnode interlock. - Don't use the B_LOCKED flag and QUEUE_LOCKED for background write buffers. Check for the BKGRDINPROG flag before recycling or throwing away a buffer. We do this instead because it is not safe for us to move the original buffer to a new queue from the callback on the background write buffer. - Remove the B_LOCKED flag and the locked buffer queue. They are no longer used. - The vnode interlock is used around checks for BKGRDINPROG where it may not be strictly necessary. If we hold the buf lock the a back-ground write will not be started without our knowledge, one may only be completed while we're not looking. Rather than remove the code, Document two of the places where this extra locking is done. A pass should be done to verify and minimize the locking later.
|
119088 |
18-Aug-2003 |
alc |
The previous change necessitates the addition of a new #include. Otherwise, there is a compilation warning.
|
119049 |
17-Aug-2003 |
phk |
Don't use a VOP_*() function on our own vnodes, go directly to the relevant internal function, in this case ufs_bmaparray().
|
118986 |
16-Aug-2003 |
alc |
Revision 1.44 of ufs/ufs/inode.h has made it necessary to add two new #includes to this file. Otherwise, it doesn't compile.
|
118969 |
15-Aug-2003 |
phk |
Eliminate the i_devvp field from the incore UFS inodes, we can get the same value from ip->i_ump->um_devvp.
This saves a pointer in the memory copies of inodes, which can easily run into several hundred kilobytes.
The extra indirection is unmeasurable in benchmarks.
Approved by: mckusick
|
118607 |
07-Aug-2003 |
jhb |
Consistently use the BSD u_int and u_short instead of the SYSV uint and ushort. In most of these files, there was a mixture of both styles and this change just makes them self-consistent.
Requested by: bde (kern_ktrace.c)
|
118131 |
28-Jul-2003 |
rwatson |
Rename VOP_RMEXTATTR() to VOP_DELETEEXTATTR() for consistency with the kernel ACL interfaces and system call names.
Break out UFS2 and FFS extattr delete and list vnode operations from setextattr and getextattr to deleteextattr and listextattr, which cleans up the implementations, and makes the results more readable, and makes the APIs more clear.
Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
|
118047 |
26-Jul-2003 |
phk |
Add a "int fd" argument to VOP_OPEN() which in the future will contain the filedescriptor number on opens from userland.
The index is used rather than a "struct file *" since it conveys a bit more information, which may be useful to in particular fdescfs and /dev/fd/*
For now pass -1 all over the place.
|
116423 |
15-Jun-2003 |
alc |
Lock the vm object when freeing pages.
|
116412 |
15-Jun-2003 |
phk |
Add the same KASSERT to all VOP_STRATEGY and VOP_SPECSTRATEGY implementations to check that the buffer points to the correct vnode.
|
116271 |
12-Jun-2003 |
phk |
Initialize struct vfsops C99-sparsely.
Submitted by: hmp Reviewed by: phk
|
116192 |
11-Jun-2003 |
obrien |
Use __FBSDID().
|
115869 |
05-Jun-2003 |
rwatson |
Implement ffs_listextattr() by breaking out that logic and special-cased attribute name of "" from ffs_getextattr(). Invoking VOP_GETETATTR() with an empty name is now no longer supported; user application compatibility is provided by a system call level compatibility wrapper. We make sure to explicitly reject attempts to set an EA with the name "".
Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
|
115588 |
01-Jun-2003 |
rwatson |
Return EOPNOTSUPP for attempted EA operations on VCHR vnodes in UFS2; if we permit them to occur, the kernel panics due to our performing EA operations using VOP_STRATEGY on the vnode. This went unnoticed previously because there are very for users of device nodes on UFS2 due to the introduction of devfs. However, this can come up with the Linux compat directories and its hard-coded dev nodes (which will need to go away as we move away from hard-coded device numbers). This can come up if you use EA-intensive features such as ACLs and MAC.
The proper fix is pretty complicated, but this band-aid would be an excellent MFC candidate for the release.
|
115474 |
31-May-2003 |
phk |
Remove unused local variables.
Found by: FlexeLint
|
115456 |
31-May-2003 |
phk |
The IO_NOWDRAIN and B_NOWDRAIN hacks are no longer needed to prevent deadlocks with vnode backed md(4) devices because md now uses a kthread to run the bio requests instead of doing it directly from the bio down path.
|
115145 |
18-May-2003 |
alc |
Lock the vm object when performing vm_object_page_clean().
Approved by: re (rwatson)
|
114599 |
03-May-2003 |
alc |
Lock the vm_object on entry to vm_object_vndeallocate().
|
114396 |
01-May-2003 |
tjr |
Do not attempt to free NULL dinodes (i_din1 or i_din2) in ffs_ifree(). These fields can be left as NULL if ffs_vget() allocates an inode but fails before the dinode memory has been allocated. There are two cases when this can occur: when we lose a race and another process has added the inode to the hash, and when reading the inode off disk fails.
The bug was observed by Kris on one of the package-building machines. See http://marc.theaimsgroup.com/?l=freebsd-current&m=105172731013411&w=2 In Kris's case, it was the bread() that failed because of a disk error.
The alternative to this patch is to ensure that ffs_vget() does not call vput() when the inode that hasn't been properly initialised.
|
114395 |
01-May-2003 |
tjr |
Free i_din2 instead of i_din1 in ffs_ifree() on UFS2 filesystems. This is purely a cosmetic change because these members are in a union together.
|
114293 |
30-Apr-2003 |
markm |
Fix some easy, global, lint warnings. In most cases, this means making some local variables static. In a couple of cases, this means removing an unused variable.
|
114216 |
29-Apr-2003 |
kan |
Deprecate machine/limits.h in favor of new sys/limits.h. Change all in-tree consumers to include <sys/limits.h>
Discussed on: standards@ Partially submitted by: Craig Rodrigues <rodrigc@attbi.com>
|
113872 |
22-Apr-2003 |
jhb |
Lock both the proc lock and sched_lock when calling sched_nice since kg_nice is now protected by both. Being protected by both means that other places in the kernel that want to read kg_nice only need one of the two locks.
|
113376 |
12-Apr-2003 |
jeff |
- Use the sched_nice() api instead of setting the nice value directly.
Tested by: Steve Kargl <sgk@troutmask.apl.washington.edu>
|
113175 |
06-Apr-2003 |
alc |
Sufficient access checks are performed by vmapbuf() that calling useracc() is pointless. Remove the call to useracc().
Don't reinitialize fields that are already initialized by getpbuf().
Reviewed by: tegge
|
112724 |
27-Mar-2003 |
tegge |
Check return value from vmapbuf instead of the function address.
|
112718 |
27-Mar-2003 |
tegge |
Eliminate a buffer sleep/wakeup race.
|
112694 |
26-Mar-2003 |
tegge |
Add support for reading directly from file to userland buffer when the O_DIRECT descriptor status flag is set and both offset and length is a multiple of the physical media sector size.
|
112451 |
20-Mar-2003 |
jhb |
Use td->td_ucred instead of td->td_proc->p_ucred.
|
112450 |
20-Mar-2003 |
jhb |
Minor fixes to ffs_fserr(): - Assume that curthread is not NULL. It never is in -current. - Use td_ucred instead of p_ucred.
|
112367 |
18-Mar-2003 |
phk |
Including <sys/stdint.h> is (almost?) universally only to be able to use %j in printfs, so put a newsted include in <sys/systm.h> where the printf prototype lives and save everybody else the trouble.
|
112181 |
13-Mar-2003 |
jeff |
- Remove a race between fsync like functions and flushbufqueues() by requiring locked bufs in vfs_bio_awrite(). Previously the buf could have been written out by fsync before we acquired the buf lock if it weren't for giant. The cluster_wbuild() handles this race properly but the single write at the end of vfs_bio_awrite() would not. - Modify flushbufqueues() so there is only one copy of the loop. Pass a parameter in that says whether or not we should sync bufs with deps. - Call flushbufqueues() a second time and then break if we couldn't find any bufs without deps.
|
111972 |
07-Mar-2003 |
mckusick |
Use the appropriate size when zeroing out the unused portion of a snapshot's copy of a superblock. This patch fixes a panic when taking a snapshot of a 4096/512 filesystem.
Reported by: Ian Freislich <ianf@za.uu.net> Sponsored by: DARPA & NAI Labs.
|
111937 |
06-Mar-2003 |
alc |
Remove ENABLE_VFS_IOOPT. It is a long unfinished work-in-progress.
Discussed on: arch@
|
111856 |
04-Mar-2003 |
jeff |
- Add a new 'flags' parameter to getblk(). - Define one flag GB_LOCK_NOWAIT that tells getblk() to pass the LK_NOWAIT flag to the initial BUF_LOCK(). This will eventually be used in cases were we want to use a buffer only if it is not currently in use. - Convert all consumers of the getblk() api to use this extra parameter.
Reviwed by: arch Not objected to by: mckusick
|
111748 |
02-Mar-2003 |
des |
More low-hanging fruit: kill caddr_t in calls to wakeup(9) / [mt]sleep(9).
|
111510 |
25-Feb-2003 |
mckusick |
Change the field used to test whether the superblock has been updated from the filesystem size field to the filesystem maximum blocksize field. The problem is that older versions of growfs updated only the new size field and not the old size field. This resulted in the old (smaller) size field being copied up to the new size field which caused the filesystem to appear to fsck to be badly trashed.
This also adds a sanity check to ensure that the superblock is not being updated when the filesystem is mounted read-only. Obviously such an update should never happen.
Reported by: Nate Lawson <nate@root.org> Sponsored by: DARPA & NAI Labs.
|
111463 |
25-Feb-2003 |
jeff |
- Add an interlock argument to BUF_LOCK and BUF_TIMELOCK. - Remove the buftimelock mutex and acquire the buf's interlock to protect these fields instead. - Hold the vnode interlock while locking bufs on the clean/dirty queues. This reduces some cases from one BUF_LOCK with a LK_NOWAIT and another BUF_LOCK with a LK_TIMEFAIL to a single lock.
Reviewed by: arch, mckusick
|
111420 |
24-Feb-2003 |
mckusick |
When removing the last item from a non-empty worklist, the worklist tail pointer must be updated.
Reported by: Kris Kennaway <kris@obsecurity.org> Sponsored by: DARPA & NAI Labs.
|
111240 |
22-Feb-2003 |
mckusick |
This patch fixes a deadlock between the bufdaemon and a process taking a snapshot. As part of taking a snapshot of a filesystem, the kernel builds up a list of the filesystem metadata (such as the cylinder group bitmaps) that are contained in the snapshot. When doing a copy-on-write check, the list is first consulted. If the block being written is found on the list, then the full snapshot lookup can be avoided. Besides providing an important performance speedup this check also avoids a potential deadlock between the code creating the snapshot and the bufdaemon trying to cleanup snapshot related buffers. This fix creates a temporary list containing the key metadata blocks that can cause the deadlock. This temporary list is used between the time that the snapshot is first enabled and the time that the fully complete list is built.
Reported by: Attila Nagy <bra@fsn.hu> Sponsored by: DARPA & NAI Labs.
|
111239 |
22-Feb-2003 |
mckusick |
This patch fixes a bug on an active filesystem on which a snapshot is being taken from panicing with either "freeing free block" or "freeing free inode". The problem arises when the snapshot code is scanning the filesystem looking for inodes with a reference count of zero (e.g., unlinked but still open) so that it can expunge them from its view. If it encounters a reclaimed vnode and has to restart its scan, then it will panic if it encounters and tries to free an inode that it has already processed. The fix is to check each candidate inode to see if it has already been processed before trying to delete it from the snapshot image.
Sponsored by: DARPA & NAI Labs.
|
111238 |
22-Feb-2003 |
mckusick |
This patch fixes a bug in the logical block calculation macros so that they convert to 64-bit values before shifting rather than afterwards. Once fixed, they can be used rather than inline expanded.
Sponsored by: DARPA & NAI Labs.
|
111119 |
19-Feb-2003 |
imp |
Back out M_* changes, per decision of the TRB.
Approved by: trb
|
110885 |
14-Feb-2003 |
mckusick |
Replace use of random() with arc4random() to provide less guessable values for the initial inode generation numbers in newfs and for newly allocated inode generation numbers in the kernel.
Submitted by: Theo de Raadt <deraadt@cvs.openbsd.org> Sponsored by: DARPA & NAI Labs.
|
110837 |
14-Feb-2003 |
mckusick |
Correct lines incorrectly added to the copyright message.
Submitted by: Frank van der Linden <fvdl@wasabisystems.com> Sponsored by: DARPA & NAI Labs.
|
110584 |
09-Feb-2003 |
jeff |
- Cleanup unlocked accesses to buf flags by introducing a new b_vflag member that is protected by the vnode lock. - Move B_SCANNED into b_vflags and call it BV_SCANNED. - Create a vop_stdfsync() modeled after spec's sync. - Replace spec_fsync, msdos_fsync, and hpfs_fsync with the stdfsync and some fs specific processing. This gives all of these filesystems proper behavior wrt MNT_WAIT/NOWAIT and the use of the B_SCANNED flag. - Annotate the locking in buf.h
|
109623 |
21-Jan-2003 |
alfred |
Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
|
109153 |
13-Jan-2003 |
dillon |
Bow to the whining masses and change a union back into void *. Retain removal of unnecessary casts and throw in some minor cleanups to see if anyone complains, just for the hell of it.
|
109123 |
12-Jan-2003 |
dillon |
Change struct file f_data to un_data, a union of the correct struct pointer types, and remove a huge number of casts from code using it.
Change struct xfile xf_data to xun_data (ABI is still compatible).
If we need to add a #define for f_data and xf_data we can, but I don't think it will be necessary. There are no operational changes in this commit.
|
109053 |
10-Jan-2003 |
marcel |
o Improve wording of the comment that accompanies fs_pad. The padding is not specific to non-i386 architectures. It is caused by non-i386 specific alignment requirements of fs_swuid, o Add a CTASSERT to catch a change in the size of struct fs at compile-time rather than run-time.
Ok'd: gordon Tested on: i386 ia64
|
109034 |
09-Jan-2003 |
gordon |
Fix superblock alignment problems on non-i386 platforms. Also change fs_uuid to fs_swuid, making it more descriptive.
Submitted by: marcel Reviewed by: peter Pointy hat to: gordon
|
108970 |
08-Jan-2003 |
gordon |
Steal some space from fs_fsmnt to create fs_volname and fs_uuid. The volname will be used to support volume names with the help of a GEOM module (to be committed). uuid will be used to deal with conflicting volume names (which doesn't work just yet).
Approved by: mckusick@
|
108892 |
07-Jan-2003 |
mckusick |
This patch fixes a problem caused by applications that rapidly and repeatedly truncate the same file. Each time the file is truncated, a buffer is grabbed to store the indirect block numbers that need to be freed. Those blocks cannot be freed until the inode claiming them is written to disk. Thus, the number of buffers being held by soft updates explodes and in extreme cases can run the kernel out of buffers. The problem can be avoided by doing an fsync on the file every debug.maxindirdep truncates (currently defaulted to 50). The fsync causes the inode to be written so that the held buffers can be freed. The check for excessive buffers is checked as part of the existing hook for excessive dependencies (softdep_slowdown) in the truncate code.
Reported by: David Schultz <dschultz@uclink.Berkeley.EDU> Sponsored by: DARPA & NAI Labs. MFC after: 3 weeks
|
108589 |
03-Jan-2003 |
phk |
Convert calls to BUF_STRATEGY to VOP_STRATEGY calls. This is a no-op since all BUF_STRATEGY did in the first place was call VOP_STRATEGY.
|
108533 |
01-Jan-2003 |
schweikh |
Correct typos, mostly s/ a / an / where appropriate. Some whitespace cleanup, especially in troff files.
|
108524 |
01-Jan-2003 |
alfred |
When compiling the kernel do not implicitly include filedesc.h from proc.h, this was causing filedesc work to be very painful. In order to make this work split out sigio definitions to thier own header (sigio.h) which is included from proc.h for the time being.
|
108316 |
27-Dec-2002 |
phk |
Use three UMA zones for FFS/UFS inodes instead of malloc space. Since inodes are currently 144 bytes, this will save 112 bytes per inode. This can amount to up to 10MByte on large systems.
|
108315 |
27-Dec-2002 |
phk |
Move the allocation of the inode contents into ffs_vfsops.c rather than passing malloc types around.
|
108313 |
27-Dec-2002 |
phk |
Make ffs_mountfs() static.
Remove the malloctype from the ufs mount structure, instead add a callback to the storage method for freeing inodes: UFS_IFREE().
Add vfs_ifree() method function which frees an inode.
Unvariablelize the malloc type used for allocating inodes.
|
108050 |
18-Dec-2002 |
mckusick |
Fix corruption introduced in previous delta.
Reported by: Aurelien Nephtali <aurelien.nephtali@wanadoo.fr> Sponsored by: DARPA & NAI Labs.
|
108017 |
18-Dec-2002 |
mckusick |
Keep comments consistent with the code. Minor optimization.
Sponsored by: DARPA & NAI Labs.
|
108010 |
18-Dec-2002 |
mckusick |
Cosmetic cleanup of unsigned buglets.
Submitted by: Bruce Evans <bde@zeta.org.au> Sponsored by: DARPA & NAI Labs.
|
107992 |
17-Dec-2002 |
phk |
Remove unused lockcnt variable.
Approved by: mckusick
|
107915 |
15-Dec-2002 |
mckusick |
Update to previous change (1.54) to use an approperly wide inode field so as to work correctly on 64-bit platforms.
Reported-by: Jake Burkholder <jake@locore.ca> Sponsored by: DARPA & NAI Labs. Approved by: Ian Dowse <iedowse@maths.tcd.ie>
|
107848 |
14-Dec-2002 |
mckusick |
Only the most recent snapshot contains the complete list of blocks that were copied in all of the earlier snapshots, thus its precomputed list must be used in the copyonwrite test. Using incomplete lists may lead to deadlock. Also do not include the blocks used for the indirect pointers in the indirect pointers as this may lead to inconsistent snapshots.
Sponsored by: DARPA & NAI Labs. Approved by: re
|
107762 |
12-Dec-2002 |
trhodes |
Remove the comment about dump(8) not working properly with snapshots.
Discussed with: mckusick Approved by: re (rwatson)
|
107651 |
06-Dec-2002 |
mckusick |
More tightly verify the preference returned for the new inode.
Submitted by: Kris Kennaway <kris@obsecurity.org> Sponsored by: DARPA & NAI Labs. Approved by: re
|
107558 |
03-Dec-2002 |
mckusick |
Have to use bread() rather than UFS_BALLOC() when obtaining a previously allocated block as the previous use of the block may have fallen out of the cache. Failure to reread its contents cause zeroed results to be written instead of the proper contents. Conversely, when the block is going to be entirely filled in, it is not necessary reread the old contents.
Sponsored by: DARPA & NAI Labs. Approved by: re
|
107415 |
30-Nov-2002 |
mckusick |
Add a check to disable the previous patch so that future filesystems that choose to place their superblocks in non-standard locations will not get them smashed.
Sponsored by: DARPA & NAI Labs.
|
107414 |
30-Nov-2002 |
mckusick |
Remove a race condition / deadlock from snapshots. When converting from individual vnode locks to the snapshot lock, be sure to pass any waiting processes along to the new lock as well. This transfer is done by a new function in the lock manager, transferlockers(from_lock, to_lock); Thanks to Lamont Granquist <lamont@scriptkiddie.org> for his help in pounding on snapshots beyond all reason and finding this deadlock.
Sponsored by: DARPA & NAI Labs.
|
107406 |
30-Nov-2002 |
mckusick |
Fix two deadlocks in snapshots:
1) Release the snapshot file lock while suspending the system. Otherwise a process trying to read the lock may block on its containing directory preventing the suspension from completing. Thanks to Sean Kelly <smkelly@zombie.org> for finding this deadlock.
2) Replace some bdwrite's with bawrite's so as not to fill all the buffers with dirty data. The buffers could not be cleaned as the snapshot vnode was locked hence the system could deadlock when making snapshots of really massive filesystems. Thanks to Hidetoshi Shimokawa <simokawa@sat.t.u-tokyo.ac.jp> for figuring this out.
Sponsored by: DARPA & NAI Labs.
|
107393 |
29-Nov-2002 |
mckusick |
Check to make sure that the fs_sblockloc field was properly updated before using it to write the superblock. This is to guard against accidentally trashing the disklabel if the superblock format missed being upgraded by the new kernel.
Reported by: Sam Leffler <sam@errno.com> Sponsored by: DARPA & NAI Labs. Approved by: Murray Stokely <murray@FreeBSD.org>
|
107294 |
27-Nov-2002 |
mckusick |
Create a new 32-bit fs_flags word in the superblock. Add code to move the old 8-bit fs_old_flags to the new location the first time that the filesystem is mounted by a new kernel. One of the unused flags in fs_old_flags is used to indicate that the flags have been moved. Leave the fs_old_flags word intact so that it will work properly if used on an old kernel.
Change the fs_sblockloc superblock location field to be in units of bytes instead of in units of filesystem fragments. The old units did not work properly when the fragment size exceeeded the superblock size (8192). Update old fs_sblockloc values at the same time that the flags are moved.
Suggested by: BOUWSMA Barry <freebsd-misuser@netscum.dyndns.dk> Sponsored by: DARPA & NAI Labs.
|
107096 |
20-Nov-2002 |
mckusick |
The target for the maximum number of dependencies has been cut in half because of reports that under heavy load the kernel could exhaust its memory pool. The limit is now (desiredvnodes * 4) rather than (desiredvnodes * 8), so it will still scale with larger systems, just not as quickly.
Sponsored by: DARPA & NAI Labs.
|
107095 |
20-Nov-2002 |
mckusick |
If an error occurs while writing a buffer, then the data will not have hit the disk and the dependencies cannot be unrolled. In this case, the system will mark the buffer as dirty again so that the write can be retried in the future. When the write succeeds or the system gives up on the buffer and marks it as invalid (B_INVAL), the dependencies will be cleared.
Sponsored by: DARPA & NAI Labs.
|
106965 |
15-Nov-2002 |
peter |
Do not assume that time_t is an int.
Approved by: re (jhb)
|
105988 |
26-Oct-2002 |
rwatson |
Slightly change the semantics of vnode labels for MAC: rather than "refreshing" the label on the vnode before use, just get the label right from inception. For single-label file systems, set the label in the generic VFS getnewvnode() code; for multi-label file systems, leave the labeling up to the file system. With UFS1/2, this means reading the extended attribute during vfs_vget() as the inode is pulled off disk, rather than hitting the extended attributes frequently during operations later, improving performance. This also corrects sematics for shared vnode locks, which were not previously present in the system. This chances the cache coherrency properties WRT out-of-band access to label data, but in an acceptable form. With UFS1, there is a small race condition during automatic extended attribute start -- this is not present with UFS2, and occurs because EAs aren't available at vnode inception. We'll introduce a work around for this shortly.
Approved by: re Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
|
105902 |
25-Oct-2002 |
mckusick |
Within ufs, the ffs_sync and ffs_fsync functions did not always check for and/or report I/O errors. The result is that a VFS_SYNC or VOP_FSYNC called with MNT_WAIT could loop infinitely on ufs in the presence of a hard error writing a disk sector or in a filesystem full condition. This patch ensures that I/O errors will always be checked and returned. This patch also ensures that every call to VFS_SYNC or VOP_FSYNC with MNT_WAIT set checks for and takes appropriate action when an error is returned.
Sponsored by: DARPA & NAI Labs.
|
105823 |
23-Oct-2002 |
mckusick |
We must be careful to avoid recursive copy-on-write faults when trying to clean up during disk-full senarios.
Sponsored by: DARPA & NAI Labs.
|
105763 |
23-Oct-2002 |
mckusick |
Missplaced FREE_LOCK causes a panic when hit while taking a snapshot.
Sponsored by: DARPA & NAI Labs.
|
105670 |
22-Oct-2002 |
mckusick |
This update further fine tunes the locking of snapshot vnodes in the ffs_copyonwrite routine to avoid a deadlock between the syncer daemon trying to sync out a snapshot vnode and the bufdaemon trying to write out a buffer containing the snapshot inode. With any luck this will be the last snapshot race condition.
Sponsored by: DARPA & NAI Labs.
|
105669 |
22-Oct-2002 |
mckusick |
This update is a performance improvement when allocating blocks on a full filesystem. Previously, if the allocation failed, we had to fsync the file before rolling back any partial allocation of indirect blocks. Most block allocation requests only need to allocate a single data block and if that allocation fails, there is nothing to unroll. So, before doing the fsync, we check to see if any rollback will really be necessary. If none is necessary, then we simply return. This update eliminates the flurry of disk activity that got triggered whenever a filesystem would run out of space.
Sponsored by: DARPA & NAI Labs.
|
105667 |
22-Oct-2002 |
mckusick |
This checkin reimplements the io-request priority hack in a way that works in the new threaded kernel. It was commented out of the disksort routine earlier this year for the reasons given in kern/subr_disklabel.c (which is where this code used to reside before it moved to kern/subr_disk.c):
---------------------------- revision 1.65 date: 2002/04/22 06:53:20; author: phk; state: Exp; lines: +5 -0 Comment out Kirks io-request priority hack until we can do this in a civilized way which doesn't cause grief.
The problem is that it is not generally safe to cast a "struct bio *" to a "struct buf *". Things like ccd, vinum, ata-raid and GEOM constructs bio's which are not entrails of a struct buf.
Also, curthread may or may not have anything to do with the I/O request at hand.
The correct solution can either be to tag struct bio's with a priority derived from the requesting threads nice and have disksort act on this field, this wouldn't address the "silly-seek syndrome" where two equal processes bang the diskheads from one edge to the other of the disk repeatedly.
Alternatively, and probably better: a sleep should be introduced either at the time the I/O is requested or at the time it is completed where we can be sure to sleep in the right thread.
The sleep also needs to be in constant timeunits, 1/hz can be practicaly any sub-second size, at high HZ the current code practically doesn't do anything. ----------------------------
As suggested in this comment, it is no longer located in the disk sort routine, but rather now resides in spec_strategy where the disk operations are being queued by the thread that is associated with the process that is really requesting the I/O. At that point, the disk queues are not visible, so the I/O for positively niced processes is always slowed down whether or not there is other activity on the disk.
On the issue of scaling HZ, I believe that the current scheme is better than using a fixed quantum of time. As machines and I/O subsystems get faster, the resolution on the clock also rises. So, ten years from now we will be slowing things down for shorter periods of time, but the proportional effect on the system will be about the same as it is today. So, I view this as a feature rather than a drawback. Hence this patch sticks with using HZ.
Sponsored by: DARPA & NAI Labs. Reviewed by: Poul-Henning Kamp <phk@critter.freebsd.dk>
|
105422 |
18-Oct-2002 |
dillon |
Fix a file-rewrite performance case for UFS[2]. When rewriting portions of a file in chunks that are less then the filesystem block size, if the data is not already cached the system will perform a read-before-write. The problem is that it does this on a block-by-block basis, breaking up the I/Os and making clustering impossible for the writes. Programs such as INN using cyclic file buffers suffer greatly. This problem is only going to get worse as we use larger and larger filesystem block sizes.
The solution is to extend the sequential heuristic so UFS[2] can perform a far larger read and readahead when dealing with this case.
(note: maximum disk write bandwidth is 27MB/sec thru filesystem) (note: filesystem blocksize in test is 8K (1K frag)) dd if=/dev/zero of=test.dat bs=1k count=2m conv=notrunc
Before: (note half of these are reads) tty da0 da1 acd0 cpu tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id 0 76 14.21 598 8.30 0.00 0 0.00 0.00 0 0.00 0 0 7 1 92 0 76 14.09 813 11.19 0.00 0 0.00 0.00 0 0.00 0 0 9 5 86 0 76 14.28 821 11.45 0.00 0 0.00 0.00 0 0.00 0 0 8 1 91
After: (note half of these are reads) tty da0 da1 acd0 cpu tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id 0 76 63.62 434 26.99 0.00 0 0.00 0.00 0 0.00 0 0 18 1 80 0 76 63.58 424 26.30 0.00 0 0.00 0.00 0 0.00 0 0 17 2 82 0 76 63.82 438 27.32 0.00 0 0.00 0.00 0 0.00 1 0 19 2 79
Reviewed by: mckusick Approved by: re X-MFC after: immediately (was heavily tested in -stable for 4 months)
|
105191 |
16-Oct-2002 |
mckusick |
Change locking so that all snapshots on a particular filesystem share a common lock. This change avoids a deadlock between snapshots when separate requests cause them to deadlock checking each other for a need to copy blocks that are close enough together that they fall into the same indirect block. Although I had anticipated a slowdown from contention for the single lock, my filesystem benchmarks show no measurable change in throughput on a uniprocessor system with three active snapshots. I conjecture that this result is because every copy-on-write fault must check all the active snapshots, so the process was inherently serial already. This change removes the last of the deadlocks of which I am aware in snapshots.
Sponsored by: DARPA & NAI Labs.
|
105169 |
15-Oct-2002 |
rwatson |
If the FS_MULTILABEL flag is set in a UFS or UFS2 superblock, automatically set MNT_MULTILABEL in the mount flags.
If FS_ACLS is set in a UFS or UFS2 superblock, automatically set MNT_ACLS in the mount flags.
If either of these flags is set, but the appropriate kernel option to support the features associated with the flag isn't available, then print a warning at mount-time.
Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
|
105136 |
14-Oct-2002 |
mckusick |
When reading or writing the extended attributes of a special device or fifo in UFS2, the normal ufs_strategy routine needs to be used rather than the spec_strategy or fifo_strategy routine. Thus the ffsext_strategy routine is interposed in the ffs_vnops vectors for special devices and fifo's to pick off this special case. Otherwise it simply falls through to the usual spec_strategy or fifo_strategy routine.
Submitted by: Robert Watson <rwatson@FreeBSD.org> Sponsored by: DARPA & NAI Labs.
|
105112 |
14-Oct-2002 |
rwatson |
Define two new superblock file system flags:
FS_ACLS Administrative enable/disable of extended ACL support FS_MULTILABEL Administrative flag to indicate to the MAC Framework that objects in the file system are individually labeled using extended attributes.
Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories Reviewed by: (in principal) mckusick, phk
|
105077 |
14-Oct-2002 |
mckusick |
Regularize the vop_stdlock'ing protocol across all the filesystems that use it. Specifically, vop_stdlock uses the lock pointed to by vp->v_vnlock. By default, getnewvnode sets up vp->v_vnlock to reference vp->v_lock. Filesystems that wish to use the default do not need to allocate a lock at the front of their node structure (as some still did) or do a lockinit. They can simply start using vn_lock/VOP_UNLOCK. Filesystems that wish to manage their own locks, but still use the vop_stdlock functions (such as nullfs) can simply replace vp->v_vnlock with a pointer to the lock that they wish to have used for the vnode. Such filesystems are responsible for setting the vp->v_vnlock back to the default in their vop_reclaim routine (e.g., vp->v_vnlock = &vp->v_lock).
In theory, this set of changes cleans up the existing filesystem lock interface and should have no function change to the existing locking scheme.
Sponsored by: DARPA & NAI Labs.
|
104716 |
09-Oct-2002 |
mux |
Fix build of 64 bit platforms.
|
104698 |
09-Oct-2002 |
mckusick |
When creating a snapshot, create a list of initially allocated blocks. Whenever doing a copy-on-write check, first look in the list of initially allocated blocks to see if it is there. If so, no further check is needed. If not, fall through and do the full check. This change eliminates one of two known deadlocks caused by snapshots. Handling the second deadlock will be the subject of another check-in. This change also reduces the cost of the copy-on-write check by speeding up the verification of frequently checked blocks.
Sponsored by: DARPA & NAI Labs.
|
104697 |
09-Oct-2002 |
mckusick |
The appropriate units for disk block addresses are always DEV_BSIZE, even when the underlying device has a larger sector size. Therefore, the filesystem code should not (and with this patch does not) try to use the underlying sector size when doing disk block address calculations.
This patch fixes problems in -current when using the swap-based memory-disk device (mdconfig -a -t swap ...). This bugfix is not relevant to -stable as -stable does not have the memory-disk device.
Sponsored by: DARPA & NAI Labs.
|
104688 |
08-Oct-2002 |
jeff |
- Remove LK_INTERLOCK from the vn_lock() in ffs_snapshot().
Pointy hat to: me Found by: green
|
104346 |
02-Oct-2002 |
dd |
size_t is not a struct (fix mislabelling in a comment).
|
104104 |
28-Sep-2002 |
jmallett |
When spamming me with a printf(9), under DIAGNOSTIC, at least be nice enough to include a newline.
MFC after: 4 days Sponsored by: Bright Path Solutions
|
104094 |
28-Sep-2002 |
phk |
Be consistent about "static" functions: if the function is marked static in its prototype, mark it static at the definition too.
Inspired by: FlexeLint warning #512
|
104051 |
27-Sep-2002 |
phk |
Use our mount-credential if we get a NOCRED when we try to write out EA space back to disk.
This is wrong in many ways, but not as wrong as a panic.
Pancied on: rwatson & jmallet Sponsored by: DARPA & NAI Labs.
|
103946 |
25-Sep-2002 |
jeff |
- Convert locks to use standard macros. - Lock access to the buflists. - Document broken locking. - Use vrefcnt().
|
103945 |
25-Sep-2002 |
jeff |
- Document broken locking. - Use vrefcnt().
|
103690 |
20-Sep-2002 |
phk |
We don't need to #include <sys/disklabel.h>. We don't need to #include <sys/disklabel.h> second time either.
Sponsored by: DARPA & NAI Labs.
|
103594 |
19-Sep-2002 |
obrien |
intmax_t is printed with %jd, not %lld.
|
103314 |
14-Sep-2002 |
njl |
Remove all use of vnode->v_tag, replacing with appropriate substitutes. v_tag is now const char * and should only be used for debugging.
Additionally: 1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK 2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP.
Suggested by: phk Reviewed by: bde, rwatson (earlier version)
|
102991 |
05-Sep-2002 |
phk |
Implement the VOP_OPENEXTATTR() and VOP_CLOSEEXTATTR() methods.
Use extattr_check_cred() to check access to EAs.
This is still a WIP.
Sponsored by: DARPA & NAI Labs.
|
102957 |
05-Sep-2002 |
bde |
Include <sys/malloc.h> instead of depending on namespace pollution 2 layers deep in <sys/proc.h> or <sys/vnode.h>.
Include <sys/vmmeter.h> instead of depending on namespace pollution in <sys/pcpu.h>.
Sorted includes as much as possible.
|
102608 |
30-Aug-2002 |
phk |
Correctly handle setting, getting and deleting EA's with zero length content.
Sponsored by: DARPA & NAI Labs.
|
102382 |
25-Aug-2002 |
alc |
o Retire vm_page_zero_fill() and vm_page_zero_fill_area(). Ever since pmap_zero_page() and pmap_zero_page_area() were modified to accept a struct vm_page * instead of a physical address, vm_page_zero_fill() and vm_page_zero_fill_area() have served no purpose.
|
102175 |
20-Aug-2002 |
phk |
Implement list of EA return functionality. Correctly delete EA's when the content length is set to zero.
Sponsored by: DARPA & NAI Labs.
|
102090 |
19-Aug-2002 |
phk |
First snapshot of UFS2 EA support.
Sponsored by: DARPA & NAI Labs.
|
101789 |
13-Aug-2002 |
phk |
Expand the arguments to ffs_ext{read,write}() to their component parts rather than use vop_{read,write}_args. Access to these functions will ultimately not be available through the "vop_{read,write}+IO_EXT" API but this functionality is retained for debugging purposes for now.
Sponsored by: DARPA & NAI Labs.
|
101780 |
13-Aug-2002 |
phk |
Unravel the UFS_EXTATTR incest between FFS and UFS: UFS_EXTATTR is an UFS only thing, and FFS should in principle not know if it is enabled or not.
This commit cleans ffs_vnops.c for such knowledge, but not ffs_vfsops.c
Sponsored by: DARPA and NAI Labs.
|
101777 |
13-Aug-2002 |
phk |
Introduce typedefs for the member functions of struct vfsops and employ these in the main filesystems. This does not change the resulting code but makes the source a little bit more grepable.
Sponsored by: DARPA and NAI Labs.
|
101720 |
12-Aug-2002 |
phk |
Stop pretending that the FFS file ufs_readwrite.c is a UFS file.
Instead of #including it, pull it into ffs_vnops.c and name things correctly.
Sponsored by: DARPA & NAI Labs.
|
101398 |
05-Aug-2002 |
iedowse |
Don't call softdep_slowdown() if soft updates are not active on the filesystem. This causes a panic for kernels compiled without softupdates.
Reported by: luigi
|
101308 |
04-Aug-2002 |
jeff |
- Replace v_flag with v_iflag and v_vflag - v_vflag is protected by the vnode lock and is used when synchronization with VOP calls is needed. - v_iflag is protected by interlock and is used for dealing with vnode management issues. These flags include X/O LOCK, FREE, DOOMED, etc. - All accesses to v_iflag and v_vflag have either been locked or marked with mp_fixme's. - Many ASSERT_VOP_LOCKED calls have been added where the locking was not clear. - Many functions in vfs_subr.c were restructured to provide for stronger locking.
Idea stolen from: BSD/OS
|
101018 |
31-Jul-2002 |
phk |
I forgot this bit of uglyness in the fsck_ffs cleanup.
|
100926 |
30-Jul-2002 |
phk |
Fix braino in last commit.
|
100925 |
30-Jul-2002 |
phk |
Move ffs_isfreeblock() to ffs_alloc.c and make it static.
Sponsored by: DARPA & NAI Labs.
|
100393 |
20-Jul-2002 |
benno |
Add a missing argument to the stub for softdep_setup_freeblocks.
Forgotten by: mckusick
|
100382 |
20-Jul-2002 |
peter |
Fix a warning: ffs_softdep.c:1630: warning: int format, different type arg (arg 2)
|
100344 |
19-Jul-2002 |
mckusick |
Add support to UFS2 to provide storage for extended attributes. As this code is not actually used by any of the existing interfaces, it seems unlikely to break anything (famous last words).
The internal kernel interface to manipulate these attributes is invoked using two new IO_ flags: IO_NORMAL and IO_EXT. These flags may be specified in the ioflags word of VOP_READ, VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that you want to do I/O to the normal data part of the file and IO_EXT means that you want to do I/O to the extended attributes part of the file. IO_NORMAL and IO_EXT are mutually exclusive for VOP_READ and VOP_WRITE, but may be specified individually or together in the case of VOP_TRUNCATE. For example, when removing a file, VOP_TRUNCATE is called with both IO_NORMAL and IO_EXT set. For backward compatibility, if neither IO_NORMAL nor IO_EXT is set, then IO_NORMAL is assumed.
Note that the BA_ and IO_ flags have been `merged' so that they may both be used in the same flags word. This merger is possible by assigning the IO_ flags to the low sixteen bits and the BA_ flags the high sixteen bits. This works because the high sixteen bits of the IO_ word is reserved for read-ahead and help with write clustering so will never be used for flags. This merge lets us get away from code of the form:
if (ioflags & IO_SYNC) flags |= BA_SYNC;
For the future, I have considered adding a new field to the vattr structure, va_extsize. This addition could then be exported through the stat structure to allow applications to find out the size of the extended attribute storage and also would provide a more standard interface for truncating them (via VOP_SETATTR rather than VOP_TRUNCATE).
I am also contemplating adding a pathconf parameter (for concreteness, lets call it _PC_MAX_EXTSIZE) which would let an application determine the maximum size of the extended atribute storage.
Sponsored by: DARPA & NAI Labs.
|
100201 |
16-Jul-2002 |
mckusick |
Change the name of st_createtime to st_birthtime. This change is made to reduce confusion between st_ctime and st_createtime.
Submitted by: Eric Allman <eric@sendmail.org> Sponsored by: DARPA & NAI Labs.
|
99888 |
12-Jul-2002 |
trhodes |
Fix a type: s/your are/you are/
|
99590 |
08-Jul-2002 |
bde |
Fixed some printf format errors (4 new ones reported by gcc and 5 nearby old ones not reported by gcc). This helps unbreak LINT.
|
99220 |
01-Jul-2002 |
iedowse |
Use indirect function pointer hooks instead of #ifdef SOFTUPDATES direct calls for the two places where the kernel calls into soft updates code. Set up the hooks in softdep_initialize() and NULL them out in softdep_uninitialize(). This change allows soft updates to function correctly when ufs is loaded as a module.
Reviewed by: mckusick
|
99206 |
01-Jul-2002 |
iedowse |
Add the ffs bits necessary to support unloading of the ufs kernel module. This adds an ffs_uninit() function that calls ufs_uninit() and also calls a new softdep_uninitialize() function. Add a stub for softdep_uninitialize() to cover the non-SOFTUPDATES case.
Reviewed by: mckusick
|
98888 |
26-Jun-2002 |
iedowse |
Remove the kernel file-size limit for UFS2, so that only the limit imposed by the filesystem structure itself remains. With 16k blocks, the maximum file size is now just over 128TB.
For now, the UFS1 file size limit is left unchanged so as to remain consistent with RELENG_4, but it too could be removed in the future.
Reviewed by: mckusick
|
98770 |
24-Jun-2002 |
jlemon |
Prototype fixes (long newinum --> ino_t newinum).
|
98687 |
23-Jun-2002 |
mux |
Warning fixes for 64 bits platforms. This eliminates all the warnings I have had in the FFS code on sparc64.
Reviewed by: mckusick
|
98658 |
23-Jun-2002 |
dillon |
Rename the BALLOC flags from B_* to BA_* to avoid confusion with the struct buf B_ flags.
Approved by: mckusick
|
98640 |
22-Jun-2002 |
mckusick |
This patch fixes a problem whereby filesystems that ran out of inodes in a cylinder group would fail to check for free inodes in other cylinder groups. This bug was introduced in the UFS2 code merge two days ago.
An inode is allocated by calling ffs_valloc which calls ffs_hashalloc to do the filesystem scan. Ffs_hashalloc walks around the cylinder groups calling its passed allocator (ffs_nodealloccg in this case) until the allocator returns a non-zero result. The bug is that ffs_hashalloc expects the passed allocator function to return a 64-bit ufs2_daddr_t. When allocating inodes, it calls ffs_nodealloccg which was returning a 32-bit ino_t. The ffs_hashalloc code checked a 64-bit return value and usually found random non-zero bits in the high 32-bits so decided that the allocation had succeeded (in this case in the only cylinder group that it checked). When the result was passed back to ffs_valloc it looked at only the bottom 32-bits, saw zero and declared the system out of inodes. But ffs_hashalloc had really only checked one cylinder group.
The fix is to change ffs_nodealloccg to return 64-bit results.
Sponsored by: DARPA & NAI Labs. Submitted by: Poul-Henning Kamp <phk@critter.freebsd.dk> Reviewed by: Maxime Henrion <mux@freebsd.org>
|
98542 |
21-Jun-2002 |
mckusick |
This commit adds basic support for the UFS2 filesystem. The UFS2 filesystem expands the inode to 256 bytes to make space for 64-bit block pointers. It also adds a file-creation time field, an ability to use jumbo blocks per inode to allow extent like pointer density, and space for extended attributes (up to twice the filesystem block size worth of attributes, e.g., on a 16K filesystem, there is space for 32K of attributes). UFS2 fully supports and runs existing UFS1 filesystems. New filesystems built using newfs can be built in either UFS1 or UFS2 format using the -O option. In this commit UFS1 is the default format, so if you want to build UFS2 format filesystems, you must specify -O 2. This default will be changed to UFS2 when UFS2 proves itself to be stable. In this commit the boot code for reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c) as there is insufficient space in the boot block. Once the size of the boot block is increased, this code can be defined.
Things to note: the definition of SBSIZE has changed to SBLOCKSIZE. The header file <ufs/ufs/dinode.h> must be included before <ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and ufs_lbn_t.
Still TODO: Verify that the first level bootstraps work for all the architectures. Convert the utility ffsinfo to understand UFS2 and test growfs. Add support for the extended attribute storage. Update soft updates to ensure integrity of extended attribute storage. Switch the current extended attribute interfaces to use the extended attribute storage. Add the extent like functionality (framework is there, but is currently never used).
Sponsored by: DARPA & NAI Labs. Reviewed by: Poul-Henning Kamp <phk@freebsd.org>
|
97962 |
06-Jun-2002 |
semenu |
Fix a typo in my recently added comment: s/beleived/believed/
Submitted by: keramida
|
97640 |
30-May-2002 |
semenu |
Remove lock from ffs_vget introduced by v1.24. Instead of locking the vnode creation globaly, we allow processes to create vnodes concurently. In case of concurent creation of vnode for the one ino, we allow processes to race and then check who wins.
Assuming that concurent creation of vnode for same ino is really rare case, this is belived to be an improvement, as it just allows concurent creation of vnodes.
Idea by: bp Reviewed by: dillon MFC after: 1 month
|
96873 |
18-May-2002 |
iedowse |
Remove um_i_effnlink_valid, i_spare[] and the ufsmount_u and inode_u unions, since these were only necessary when ext2fs used ufs code.
Reviewed by: mckusick
|
96821 |
17-May-2002 |
phk |
Fix ufs_daddr_t/daddr_t type problems.
Sponsored by: DARPA & NAI labs.
|
96755 |
16-May-2002 |
trhodes |
More s/file system/filesystem/g
|
96506 |
13-May-2002 |
phk |
Remove register keyword.
Sponsored by: DARPA & NAI Labs. Submitted by: mckusick
|
96473 |
12-May-2002 |
phk |
ARGH! SBLOCK is not unused. Try to get this right.
BBSIZE belongs in <sys/disklabel.h> (but shouldn't be a constant).
Define SBLOCK again, using the right math.
Sponsored by: DARPA & NAI Labs.
|
96472 |
12-May-2002 |
phk |
Remove #define for BBOFF, it is assumed == 0 so many places that we might as well forget about it. In fact the only thing which used it was the SBOFF macro.
Sponsored by: DARPA & NAI Labs.
|
96471 |
12-May-2002 |
phk |
Remove unused BBLOCK and SBLOCK #defines.
Sponsored by: DARPA & NAI Labs.
|
95974 |
03-May-2002 |
phk |
Name ufs_vop_[gs]etextattr() consistently with the rest of our VOPs and put then in the ufs_vnops where they belong, rather than in the ffs_vnops.
Ok'ed by: rwatson Sponsored by: DARPA & NAI Labs.
|
94723 |
15-Apr-2002 |
jeff |
Don't peak into the malloc_type structure for limits. The desired vnodes check should be sufficient. This is required for the pending removal of malloc_type limits.
|
94182 |
08-Apr-2002 |
phk |
Move generic disk ioctls from <sys/disklabel.h> to <sys/disk.h>.
Sponsored by: DARPA & NAI Labs
|
93818 |
04-Apr-2002 |
jhb |
Change callers of mtx_init() to pass in an appropriate lock type name. In most cases NULL is passed, but in some cases such as network driver locks (which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.
Tested on: i386, alpha, sparc64
|
93736 |
03-Apr-2002 |
phk |
Move the FFS parameter MAXFRAG from <sys/param.h> to <ufs/ffs/fs.h>
Sponsored by: DARPA & NAI Labs.
|
93654 |
02-Apr-2002 |
phk |
Use DIOCGSECTORSIZE instead of the bogus DIOCGPART ioctl.
|
93593 |
01-Apr-2002 |
jhb |
Change the suser() API to take advantage of td_ucred as well as do a general cleanup of the API. The entire API now consists of two functions similar to the pre-KSE API. The suser() function takes a thread pointer as its only argument. The td_ucred member of this thread must be valid so the only valid thread pointers are curthread and a few kernel threads such as thread0. The suser_cred() function takes a pointer to a struct ucred as its first argument and an integer flag as its second argument. The flag is currently only used for the PRISON_ROOT flag.
Discussed on: smp@
|
93430 |
30-Mar-2002 |
bde |
In ffs_mountffs(), set mnt_iosize_max to si_iosize_max unconditionally provided the latter is nonzero. At this point, the former is a fairly arbitrary default value (DFTPHYS), so changing it to any reasonable value specified by the device driver is safe. Using the maximum of these limits broke ffs clustered i/o for devices whose si_iosize_max is < DFLTPHYS. Using the minimum would break device drivers' ability to increase the active limit from DFTLPHYS up to MAXPHYS.
Copied the code for this and the associated (unnecessary?) fixup of mp_iosize_max to all other filesystems that use clustering (ext2fs and msdosfs). It was completely missing.
PR: 36309 MFC-after: 1 week
|
92728 |
19-Mar-2002 |
alfred |
Remove __P.
|
92640 |
19-Mar-2002 |
bde |
Fixed some printf format errors (hopefully all of the remaining daddr64_t ones for GENERIC, and all others on the same line as those). Reformat the printfs if necessary to avoid new long lones or old format printf errors.
|
92462 |
17-Mar-2002 |
mckusick |
Add a flags parameter to VFS_VGET to pass through the desired locking flags when acquiring a vnode. The immediate purpose is to allow polling lock requests (LK_NOWAIT) needed by soft updates to avoid deadlock when enlisting other processes to help with the background cleanup. For the future it will allow the use of shared locks for read access to vnodes. This change touches a lot of files as it affects most filesystems within the system. It has been well tested on FFS, loopback, and CD-ROM filesystems. only lightly on the others, so if you find a problem there, please let me (mckusick@mckusick.com) know.
|
92363 |
15-Mar-2002 |
mckusick |
Introduce the new 64-bit size disk block, daddr64_t. Change the bio and buffer structures to have daddr64_t bio_pblkno, b_blkno, and b_lblkno fields which allows access to disks larger than a Terabyte in size. This change also requires that the VOP_BMAP vnode operation accept and return daddr64_t blocks. This delta should not affect system operation in any way. It merely sets up the necessary interfaces to allow the development of disk drivers that work with these larger disk block addresses. It also allows for the development of UFS2 which will use 64-bit block addresses.
|
92299 |
15-Mar-2002 |
obrien |
Quiet a warning on the Alpha.
|
92250 |
14-Mar-2002 |
mckusick |
This corrects the first of two known deadlock conditions that come from the presence of a snapshot file.
|
92095 |
11-Mar-2002 |
phk |
I missed one VOP_CLOSE in the previous commit.
Pointed out by: bde
|
92092 |
11-Mar-2002 |
phk |
As a XXX bandaid open the mounted device READ/WRITE even if we only mount read-only.
The trouble here is that we don't reopen the device in read/write mode when we remount in read/write mode resulting in a filesystem sending write requests to a device which was only opened read/only.
I'm not quite sure how such a reopen would best be done and defer the problem to more agile hackers.
|
91420 |
27-Feb-2002 |
jhb |
Use thread0.td_ucred instead of proc0.p_ucred. This change is cosmetic and isn't strictly required. However, it lowers the number of false positives found when grep'ing the kernel sources for p_ucred to ensure proper locking.
|
91406 |
27-Feb-2002 |
jhb |
Simple p_ucred -> td_ucred changes to start using the per-thread ucred reference.
|
90538 |
11-Feb-2002 |
julian |
In a threaded world, differnt priorirites become properties of different entities. Make it so.
Reviewed by: jhb@freebsd.org (john baldwin)
|
90366 |
07-Feb-2002 |
mckusick |
Occationally background fsck would cause a spurious ``freeing free inode'' panic. This change corrects that problem by setting the fs_active flag when the inode map changes to notify the snapshot code that the cylinder group must be rescanned.
Submitted by: Robert Watson <rwatson@FreeBSD.org>
|
90329 |
07-Feb-2002 |
mckusick |
Occationally deleted files would hang around for hours or days without being reclaimed. This bug was introduced in revision 1.95 dealing with filenames placed in newly allocated directory blocks, thus is not present in 4.X systems. The bug is triggered when a new entry is made in a directory after the data block containing the original new entry has been written, but before the inode that references the data block has been written.
Submitted by: Bill Fenner <fenner@research.att.com>
|
90098 |
02-Feb-2002 |
mckusick |
When taking a snapshot, we must check for active files that have been unlinked (e.g., with a zero link count). We have to expunge all trace of these files from the snapshot so that they are neither reclaimed prematurely by fsck nor saved unnecessarily by dump.
|
89680 |
23-Jan-2002 |
mckusick |
Add a stub for softdep_request_cleanup() so that compilation without SOFTUPDATES option works properly.
Submitted by: Benno Rice <benno@jeamland.net>
|
89637 |
22-Jan-2002 |
mckusick |
This patch fixes a long standing complaint with soft updates in which small and/or nearly full filesystems would fail with `file system full' messages when trying to replace a number of existing files (for example during a system installation). When the allocation routines are about to fail with a file system full condition, they make a call to softdep_request_cleanup() which attempts to accelerate the flushing of pending deletion requests in an effort to free up space. In the face of filesystem I/O requests that exceed the available disk transfer capacity, the cleanup request could take an unbounded amount of time. Thus, the softdep_request_cleanup() routine will only try for tickdelay seconds (default 2 seconds) before giving up and returning a filesystem full error. Under typical conditions, the softdep_request_cleanup() routine is able to free up space in under fifty milliseconds.
|
89450 |
17-Jan-2002 |
mckusick |
Fix a bug introduced in ffs_snapshot.c -r1.25 and fs.h -r1.26 which caused incomplete snapshots to be taken. When background fsck would run on these snapshots, the result would be files being incorrectly released which would subsequently panic the kernel with ``handle_workitem_freefile: inodedep survived'', ``handle_written_inodeblock: live inodedep'', and ``handle_workitem_remove: lost inodedep'' errors.
|
89413 |
16-Jan-2002 |
mckusick |
Put write on read-only filesystem panic after we have weeded out block and character devices, fifo's, etc.
Submitted by: Bruce Evans <bde@zeta.org.au>
|
89384 |
15-Jan-2002 |
mckusick |
When downgrading a filesystem from read-write to read-only, operations involving file removal or file update were not always being fully committed to disk. The result was lost files or corrupted file data. This change ensures that the filesystem is properly synced to disk before the filesystem is down-graded.
This delta also fixes a long standing bug in which a file open for reading has been unlinked. When the last open reference to the file is closed, the inode is reclaimed by the filesystem. Previously, if the filesystem had been down-graded to read-only, the inode could not be reclaimed, and thus was lost and had to be later recovered by fsck. With this change, such files are found at the time of the down-grade. Normally they will result in the filesystem down-grade failing with `device busy'. If a forcible down-grade is done, then the affected files will be revoked causing the inode to be released and the open file descriptors to begin failing on attempts to read.
Submitted by: "Sam Leffler" <sam@errno.com>
|
89306 |
13-Jan-2002 |
alfred |
SMP Lock struct file, filedesc and the global file list.
Seigo Tanimura (tanimura) posted the initial delta.
I've polished it quite a bit reducing the need for locking and adapting it for KSE.
Locks:
1 mutex in each filedesc protects all the fields. protects "struct file" initialization, while a struct file is being changed from &badfileops -> &pipeops or something the filedesc should be locked.
1 mutex in each struct file protects the refcount fields. doesn't protect anything else. the flags used for garbage collection have been moved to f_gcflag which was the FILLER short, this doesn't need locking because the garbage collection is a single threaded container. could likely be made to use a pool mutex.
1 sx lock for the global filelist.
struct file * fhold(struct file *fp); /* increments reference count on a file */
struct file * fhold_locked(struct file *fp); /* like fhold but expects file to locked */
struct file * ffind_hold(struct thread *, int fd); /* finds the struct file in thread, adds one reference and returns it unlocked */
struct file * ffind_lock(struct thread *, int fd); /* ffind_hold, but returns file locked */
I still have to smp-safe the fget cruft, I'll get to that asap.
|
89295 |
12-Jan-2002 |
mckusick |
When going to sleep, we must save our SPL so that it does not get lost if some other process uses the lock while we are sleeping. We restore it after we have slept. This functionality is provided by a new routine interlocked_sleep() that wraps the interlocking with functions that sleep. This function is then used in place of the old ACQUIRE_LOCK_INTERLOCKED() and FREE_LOCK_INTERLOCKED() macros.
Submitted by: Debbie Chu <dchu@juniper.net>
|
89270 |
11-Jan-2002 |
mckusick |
Must call drain_output() before checking the dirty block list in softdep_sync_metadata(). Otherwise we may miss dependencies that need to be flushed which will result in a later panic with the message ``vinvalbuf: dirty bufs''.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com> MFC after: 1 week
|
89089 |
08-Jan-2002 |
msmith |
Initialise the bioops vector hack at runtime rather than at link time. This avoids the use of common variables.
Reviewed by: mckusick
|
88318 |
20-Dec-2001 |
dillon |
Fix a BUF_TIMELOCK race against BUF_LOCK and fix a deadlock in vget() against VM_WAIT in the pageout code. Both fixes involve adjusting the lockmgr's timeout capability so locks obtained with timeouts do not interfere with locks obtained without a timeout.
Hopefully MFC: before the 4.5 release
|
88138 |
18-Dec-2001 |
mckusick |
Change the atomic_set_char to atomic_set_int and atomic_clear_char to atomic_clear_int to ease the implementation for the sparc64.
Requested by: Jake Burkholder <jake@locore.ca>
|
88026 |
16-Dec-2001 |
iedowse |
Make sure we ignore the value of `fs_active' when reloading the superblock, and move the initialisation of it to beside where other pointer fields are initialised.
|
88025 |
16-Dec-2001 |
iedowse |
Move the new superblock field `fs_active' into the region of the superblock that is already set up to handle pointer types. This fixes an accidental change in the superblock size on 64-bit platforms caused by revision 1.24.
|
87827 |
14-Dec-2001 |
mckusick |
Minimize the time necessary to suspend operations on a filesystem when taking a snapshot. The two time consuming operations are scanning all the filesystem bitmaps to determine which blocks are in use and scanning all the other snapshots so as to be able to expunge their blocks from the view of the current snapshot. The bitmap scanning is broken into two passes. Before suspending the filesystem all bitmaps are scanned. After the suspension, those bitmaps that changed after being scanned the first time are rescanned. Typically there are few bitmaps that need to be rescanned. The expunging of other snapshots is now done after the suspension is released by observing that we can easily identify any blocks that were allocated to them after the suspension (they will be maked as `not needing to be copied' in the just created snapshot). For all the gory details, see the ``Running fsck in the Background'' paper in the Usenix BSDCon 2002 Conference Proceedings, pages 55-64.
|
87782 |
13-Dec-2001 |
mckusick |
When a file is partially truncated, we first check to see if the new file end will land in the middle of a file hole. Since the last block of a file must always be allocated, the hole is filled by allocating a block at that location. If the hole being filled is a direct block, then the truncation may eventually reduce the full sized block down to a fragment. When running with soft updates, it is necessary to FSYNC the file after allocating the block and before creating the fragment to avoid triggering a soft updates inconsistency when the block unexpectedly shrinks.
Found by: Matthew Dillon <dillon@apollo.backplane.com> MFC after: 1 week
|
85517 |
26-Oct-2001 |
dillon |
Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a real effect.
Optimize vfs_msync(). Avoid having to continually drop and re-obtain mutexes when scanning the vnode list. Improves looping case by 500%.
Optimize ffs_sync(). Avoid having to continually drop and re-obtain mutexes when scanning the vnode list. This makes a couple of assumptions, which I believe are ok, in regards to vnode stability when the mount list mutex is held. Improves looping case by 500%.
(more optimization work is needed on top of these fixes)
MFC after: 1 week
|
85339 |
23-Oct-2001 |
dillon |
Change the vnode list under the mount point from a LIST to a TAILQ in preparation for an implementation of limiting code for kern.maxvnodes.
MFC after: 3 days
|
84374 |
02-Oct-2001 |
rwatson |
o Replace two direct uid!=0 comparisons with suser_xxx() calls.
Obtained from: TrustedBSD Project
|
84373 |
02-Oct-2001 |
rwatson |
o Replace two direct uid!=0 comparisons with suser_td() calls.
Obtained from: TrustedBSD Project
|
84050 |
27-Sep-2001 |
jhb |
- Fix some minor whitespace nits. - Move the SPECIAL_FLAG #define up next to the NOHOLDER #define and fix a little nit that caused it to be defined as -(sizeof (struct thread) + 1) instead of -2.
|
83366 |
12-Sep-2001 |
julian |
KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process.
Sorry john! (your next MFC will be a doosie!)
Reviewed by: peter@freebsd.org, dillon@freebsd.org
X-MFC after: ha ha ha ha
|
83263 |
09-Sep-2001 |
iedowse |
The "dirpref" directory layout preference improvements make use of an array "fs_contigdirs[]" to avoid too many directories getting created in each cylinder group. The memory required for this and two other arrays (fs_csp[] and fs_maxcluster[]) is allocated with a single malloc() call, and divided up afterwards. However, the 'space' pointer is not advanced correctly, so fs_contigdirs and fs_maxcluster end up pointing to the same address.
Add the missing code to advance the 'space' pointer, and remove an unnecessary update of the pointer that follows.
This is likely to fix the "ffs_clusteralloc: map mismatch" panics that have been reported recently.
Submitted by: Luke Mewburn <lukem@wasabisystems.com>
|
82755 |
01-Sep-2001 |
rwatson |
o At some point, unmounting a non-EA file system with EA's compiled in got a bit broken, when ufs_extattr_stop() was called and failed, ufs_extattr_destroy() would panic. This makes the call to destroy() conditional on the success of stop().
Submitted by: Christian Carstensen <cc@devcon.net> Obtained from: TrustedBSD Project
|
79769 |
16-Jul-2001 |
peter |
Use a fixed type for times in on-disk structures for ufs rather than something that could potentially change like time_t.
|
78940 |
28-Jun-2001 |
jhb |
Fix more mntvnode and vnode interlock order reversals.
|
78912 |
28-Jun-2001 |
jhb |
- Fix a mntvnode and vnode interlock reversal. - Protect the mnt_vnode list with the mntvnode lock. - Use queue(9) macros.
|
78256 |
15-Jun-2001 |
peter |
Fix warning: 1973: warning: int format, long int arg (arg 5)
|
78191 |
13-Jun-2001 |
mckusick |
Build on the change in revision 1.98 by Tor.Egge@fast.no. The symptom being treated in 1.98 was to avoid freeing a pagedep dependency if there was still a newdirblk dependency referencing it. That change is correct and no longer prints a warning message when it occurs. The other part of revision 1.98 was to panic when a newdirblk dependency was encountered during a file truncation. This fix removes that panic and replaces it with code to find and delete the newdirblk dependency so that the truncation can succeed.
|
77743 |
05-Jun-2001 |
obrien |
There seems to be a problem that the order of disk write operation being incorrect due to a missing check for some dependency. This change avoids the freelist corruption (but not the temporarily inconsistent state of the file system).
A message is printed as a reminder of the under lying problem when a pagedep structure is not freed due to the NEWBLOCK flag being set.
Submitted by: Tor.Egge@fast.no
|
77509 |
30-May-2001 |
jhb |
Revert the previous commit in favor of the fix in rev 1.42 of ufs/ffs/ffs_extern.h instead.
Requested by: bde
|
77508 |
30-May-2001 |
jhb |
Forward declare struct cg to quiet a warning.
Submitted by: bde
|
77445 |
29-May-2001 |
jhb |
Include <ufs/ffs/fs.h> to get the definition of struct cg to quiet a warning.
|
77437 |
29-May-2001 |
phk |
Remove last vestiges of MFS.
|
76900 |
20-May-2001 |
mckusick |
Update softdep_setup_directory_add prototype to reflect changes in actual function.
Obtained from: Jim Bloom <bloom@jbloom.jbloom.org>
|
76859 |
19-May-2001 |
mckusick |
Must ensure that all the entries on the pd_pendinghd list have been committed to disk before clearing them. More specifically, when free_newdirblk is called, we know that the inode claims the new directory block. However, if the associated pagedep is still linked onto the directory buffer dependency chain, then some of the entries on the pd_pendinghd list may not be committed to disk yet. In this case, we will simply note that the inode claims the block and let the pd_pendinghd list be processed when the pagedep is next written. If the pagedep is no longer on the buffer dependency chain, then all the entries on the pd_pending list are committed to disk and we can free them in free_newdirblk. This corrects a window of vulnerability introduced in the code added in version 1.95.
|
76825 |
18-May-2001 |
mckusick |
Must be a bit less aggressive about freeing pagedep structures.
Obtained from: Robert Watson <rwatson@FreeBSD.org> and Matthew Jacob <mjacob@feral.com>
|
76724 |
17-May-2001 |
mckusick |
When a new block is allocated to a directory, an fsync of a file whose name is within that block must ensure not only that the block containing the file name has been written, but also that the on-disk directory inode references that block. When a new directory block is created, we allocate a newdirblk structure which is linked to the associated allocdirect (on its ad_newdirblk list). When the allocdirect has been satisfied, the newdirblk structure is moved to the inodedep id_bufwait list of its directory to await the inode being written. When the inode is written, the directory entries are fully committed and can be deleted from their pagedep->id_pendinghd and inodedep->id_pendinghd lists.
|
76688 |
16-May-2001 |
iedowse |
Change the second argument of vflush() to an integer that specifies the number of references on the filesystem root vnode to be both expected and released. Many filesystems hold an extra reference on the filesystem root vnode, which must be accounted for when determining if the filesystem is busy and then released if it isn't busy. The old `skipvp' approach required individual filesystem xxx_unmount functions to re-implement much of vflush()'s logic to deal with the root vnode.
All 9 filesystems that hold an extra reference on the root vnode got the logic wrong in the case of forced unmounts, so `umount -f' would always fail if there were any extra root vnode references. Fix this issue centrally in vflush(), now that we can.
This commit also fixes a vnode reference leak in devfs, which could result in idle devfs filesystems that refuse to unmount.
Reviewed by: phk, bp
|
76580 |
14-May-2001 |
mckusick |
Further fixes for deadlock in the presence of multiple snapshots. There are still more to find, but this fix should cover the common cases that folks are hitting.
|
76458 |
11-May-2001 |
mckusick |
Remove yet another deadlock case.
|
76357 |
08-May-2001 |
mckusick |
When running with soft updates, track the number of blocks and files that are committed to being freed and reflect these blocks in the counts returned by statfs (and thus also by the `df' command). This change allows programs such as those that do news expiration to know when to stop if they are trying to create a certain percentage of free space. Note that this change does not solve the much harder problem of making this to-be-freed space available to applications that want it (thus on a nearly full filesystem, you may still encounter out-of-space conditions even though the free space will show up eventually). Hopefully this harder problem will be the subject of a future enhancement.
|
76356 |
08-May-2001 |
mckusick |
Several fixes for units errors: 1) Do not assume that the superblock will be of size fs->fs_bsize. This fixes a panic when taking a snapshot on a filesystem with a block size bigger than 8K. 2) Properly calculate the number of fragments that follow the superblock summary information. This fixes a bug with inconsistent snapshots. 3) When cleaning up a snapshot that is about to be removed, properly calculate the number of blocks that need to be checked. This fixes a bug that created partially allocated inodes. 4) When moving blocks from a snapshot that is about to be removed to another snapshot, properly account for the reduced number of blocks in the snapshot from which they are taken. This fixes a bug in which the number of blocks released from a snapshot did not match the number that it claimed to have.
|
76354 |
08-May-2001 |
mckusick |
When syncing out snapshot metadata, we must temporarily allow recursive buffer locking so as to avoid locking against ourselves if we need to write filesystem metadata.
|
76269 |
04-May-2001 |
mckusick |
Refinement to revision 1.16 of ufs/ffs/ffs_snapshot.c to reduce the amount of time that the filesystem must be suspended. The current snapshot is elided as well as the earlier snapshots.
|
76173 |
01-May-2001 |
phk |
Remove blatantly pointless call to VOP_BMAP().
Use ufs_bmaparray() rather than VOP_BMAP() on our own vnodes.
|
76167 |
01-May-2001 |
phk |
Implement vop_std{get|put}pages() and add them to the default vop[].
Un-copy&paste all the VOP_{GET|PUT}PAGES() functions which do nothing but the default.
|
76132 |
29-Apr-2001 |
phk |
VOP_BALLOC was never really a VOP in the first place, so convert it to UFS_BALLOC like the other "between UFS and FFS function interfaces".
|
76126 |
29-Apr-2001 |
phk |
Remove faint traces of non-existant ffs_bmap().
|
76117 |
29-Apr-2001 |
grog |
Revert consequences of changes to mount.h, part 2.
Requested by: bde
|
75993 |
26-Apr-2001 |
mckusick |
Rather than copying all the indirect blocks of the snapshot, simply mark them as BLK_NOCOPY. This trick cuts the initial size of the snapshot in half and cuts the time to take a snapshot by a third.
|
75943 |
25-Apr-2001 |
mckusick |
When closing the last reference to an unlinked file, it is freed by the inactive routine. Because the freeing causes the filesystem to be modified, the close must be held up during periods when the filesystem is suspended.
For snapshots to be consistent across crashes, they must write blocks that they copy and claim those written blocks in their on-disk block pointers before the old blocks that they referenced can be allowed to be written.
Close a loophole that allowed unwritten blocks to be skipped when doing ffs_sync with a request to wait for all I/O activity to be completed.
|
75934 |
25-Apr-2001 |
phk |
Move the netexport structure from the fs-specific mountstructure to struct mount.
This makes the "struct netexport *" paramter to the vfs_export and vfs_checkexport interface unneeded.
Consequently that all non-stacking filesystems can use vfs_stdcheckexp().
At the same time, make it a pointer to a struct netexport in struct mount, so that we can remove the bogus AF_MAX and #include <net/radix.h> from <sys/mount.h>
|
75892 |
24-Apr-2001 |
iedowse |
Pre-dirpref versions of fsck may zero out the new superblock fields fs_contigdirs, fs_avgfilesize and fs_avgfpdir. This could cause panics if these fields were zeroed while a filesystem was mounted read-only, and then remounted read-write.
Add code to ffs_reload() which copies the fs_contigdirs pointer from the previous superblock, and reinitialises fs_avgf* if necessary.
Reviewed by: mckusick
|
75858 |
23-Apr-2001 |
grog |
Correct #includes to work with fixed sys/mount.h.
|
75573 |
17-Apr-2001 |
mckusick |
Add debugging option to always read/write cylinder groups as full sized blocks. To enable this option, use: `sysctl -w debug.bigcgs=1'. Add debugging option to disable background writes of cylinder groups. To enable this option, use: `sysctl -w debug.dobkgrdwrite=0'. These debugging options should be tried on systems that are panicing with corrupted cylinder group maps to see if it makes the problem go away. The set of panics in question are:
ffs_clusteralloc: map mismatch ffs_nodealloccg: map corrupted ffs_nodealloccg: block not in map ffs_alloccg: map corrupted ffs_alloccg: block not in map ffs_alloccgblk: cyl groups corrupted ffs_alloccgblk: can't find blk in cyl ffs_checkblk: partially free fragment
The following panics are less likely to be related to this problem, but might be helped by these debugging options:
ffs_valloc: dup alloc ffs_blkfree: freeing free block ffs_blkfree: freeing free frag ffs_vfree: freeing free inode
If you try these options, please report whether they helped reduce your bitmap corruption panics to Kirk McKusick at <mckusick@mckusick.com> and to Matt Dillon <dillon@earth.backplane.com>.
|
75572 |
17-Apr-2001 |
mckusick |
Background fsck sysctl operations must use vn_start_write and vn_finished_write so that they do not attempt to modify a suspended filesystem.
|
75515 |
14-Apr-2001 |
mckusick |
Update to describe use of mdconfig instead of deprecated vnconfig.
Submitted by: Steve Ames <steve@virtual-voodoo.com>
|
75503 |
14-Apr-2001 |
mckusick |
This checkin adds support in ufs/ffs for the FS_NEEDSFSCK flag. It is described in ufs/ffs/fs.h as follows:
/* * Filesystem flags. * * Note that the FS_NEEDSFSCK flag is set and cleared only by the * fsck utility. It is set when background fsck finds an unexpected * inconsistency which requires a traditional foreground fsck to be * run. Such inconsistencies should only be found after an uncorrectable * disk error. A foreground fsck will clear the FS_NEEDSFSCK flag when * it has successfully cleaned up the filesystem. The kernel uses this * flag to enforce that inconsistent filesystems be mounted read-only. */ #define FS_UNCLEAN 0x01 /* filesystem not clean at mount */ #define FS_DOSOFTDEP 0x02 /* filesystem using soft dependencies */ #define FS_NEEDSFSCK 0x04 /* filesystem needs sync fsck before mount */
|
75377 |
10-Apr-2001 |
mckusick |
Directory layout preference improvements from Grigoriy Orlov <gluk@ptci.ru>. His description of the problem and solution follow. My own tests show speedups on typical filesystem intensive workloads of 5% to 12% which is very impressive considering the small amount of code change involved.
------
One day I noticed that some file operations run much faster on small file systems then on big ones. I've looked at the ffs algorithms, thought about them, and redesigned the dirpref algorithm.
First I want to describe the results of my tests. These results are old and I have improved the algorithm after these tests were done. Nevertheless they show how big the perfomance speedup may be. I have done two file/directory intensive tests on a two OpenBSD systems with old and new dirpref algorithm. The first test is "tar -xzf ports.tar.gz", the second is "rm -rf ports". The ports.tar.gz file is the ports collection from the OpenBSD 2.8 release. It contains 6596 directories and 13868 files. The test systems are:
1. Celeron-450, 128Mb, two IDE drives, the system at wd0, file system for test is at wd1. Size of test file system is 8 Gb, number of cg=991, size of cg is 8m, block size = 8k, fragment size = 1k OpenBSD-current from Dec 2000 with BUFCACHEPERCENT=35
2. PIII-600, 128Mb, two IBM DTLA-307045 IDE drives at i815e, the system at wd0, file system for test is at wd1. Size of test file system is 40 Gb, number of cg=5324, size of cg is 8m, block size = 8k, fragment size = 1k OpenBSD-current from Dec 2000 with BUFCACHEPERCENT=50
You can get more info about the test systems and methods at: http://www.ptci.ru/gluk/dirpref/old/dirpref.html
Test Results
tar -xzf ports.tar.gz rm -rf ports mode old dirpref new dirpref speedup old dirprefnew dirpref speedup First system normal 667 472 1.41 477 331 1.44 async 285 144 1.98 130 14 9.29 sync 768 616 1.25 477 334 1.43 softdep 413 252 1.64 241 38 6.34 Second system normal 329 81 4.06 263.5 93.5 2.81 async 302 25.7 11.75 112 2.26 49.56 sync 281 57.0 4.93 263 90.5 2.9 softdep 341 40.6 8.4 284 4.76 59.66
"old dirpref" and "new dirpref" columns give a test time in seconds. speedup - speed increasement in times, ie. old dirpref / new dirpref.
------
Algorithm description
The old dirpref algorithm is described in comments:
/* * Find a cylinder to place a directory. * * The policy implemented by this algorithm is to select from * among those cylinder groups with above the average number of * free inodes, the one with the smallest number of directories. */
A new directory is allocated in a different cylinder groups than its parent directory resulting in a directory tree that is spreaded across all the cylinder groups. This spreading out results in a non-optimal access to the directories and files. When we have a small filesystem it is not a problem but when the filesystem is big then perfomance degradation becomes very apparent.
What I mean by a big file system ?
1. A big filesystem is a filesystem which occupy 20-30 or more percent of total drive space, i.e. first and last cylinder are physically located relatively far from each other. 2. It has a relatively large number of cylinder groups, for example more cylinder groups than 50% of the buffers in the buffer cache.
The first results in long access times, while the second results in many buffers being used by metadata operations. Such operations use cylinder group blocks and on-disk inode blocks. The cylinder group block (fs->fs_cblkno) contains struct cg, inode and block bit maps. It is 2k in size for the default filesystem parameters. If new and parent directories are located in different cylinder groups then the system performs more input/output operations and uses more buffers. On filesystems with many cylinder groups, lots of cache buffers are used for metadata operations.
My solution for this problem is very simple. I allocate many directories in one cylinder group. I also do some things, so that the new allocation method does not cause excessive fragmentation and all directory inodes will not be located at a location far from its file's inodes and data. The algorithm is: /* * Find a cylinder group to place a directory. * * The policy implemented by this algorithm is to allocate a * directory inode in the same cylinder group as its parent * directory, but also to reserve space for its files inodes * and data. Restrict the number of directories which may be * allocated one after another in the same cylinder group * without intervening allocation of files. * * If we allocate a first level directory then force allocation * in another cylinder group. */
My early versions of dirpref give me a good results for a wide range of file operations and different filesystem capacities except one case: those applications that create their entire directory structure first and only later fill this structure with files.
My solution for such and similar cases is to limit a number of directories which may be created one after another in the same cylinder group without intervening file creations. For this purpose, I allocate an array of counters at mount time. This array is linked to the superblock fs->fs_contigdirs[cg]. Each time a directory is created the counter increases and each time a file is created the counter decreases. A 60Gb filesystem with 8mb/cg requires 10kb of memory for the counters array.
The maxcontigdirs is a maximum number of directories which may be created without an intervening file creation. I found in my tests that the best performance occurs when I restrict the number of directories in one cylinder group such that all its files may be located in the same cylinder group. There may be some deterioration in performance if all the file inodes are in the same cylinder group as its containing directory, but their data partially resides in a different cylinder group. The maxcontigdirs value is calculated to try to prevent this condition. Since there is no way to know how many files and directories will be allocated later I added two optimization parameters in superblock/tunefs. They are:
int32_t fs_avgfilesize; /* expected average file size */ int32_t fs_avgfpdir; /* expected # of files per directory */
These parameters have reasonable defaults but may be tweeked for special uses of a filesystem. They are only necessary in rare cases like better tuning a filesystem being used to store a squid cache.
I have been using this algorithm for about 3 months. I have done a lot of testing on filesystems with different capacities, average filesize, average number of files per directory, and so on. I think this algorithm has no negative impact on filesystem perfomance. It works better than the default one in all cases. The new dirpref will greatly improve untarring/removing/coping of big directories, decrease load on cvs servers and much more. The new dirpref doesn't speedup a compilation process, but also doesn't slow it down.
Obtained from: Grigoriy Orlov <gluk@ptci.ru>
|
74747 |
24-Mar-2001 |
asmodai |
Fix typo ); -> ,
|
74705 |
23-Mar-2001 |
mckusick |
Check that background fsck operation is being done on a ufs filesystem.
Obtained from: Robert Watson <rwatson@FreeBSD.org>
|
74548 |
21-Mar-2001 |
mckusick |
Add kernel support for running fsck on active filesystems.
|
74547 |
21-Mar-2001 |
mckusick |
Clear the fs_clean flag only when the FS_UNCLEAN flag is not set (as is done in unmount).
Remove a snapshot inode from the superblock list when its last name goes away rather than when its last reference goes away. That way it will be properly reclaimed by fsck after a crash rather than reenabled when the filesystem is mounted.
|
74545 |
21-Mar-2001 |
mckusick |
Report the correct inode number when panicing with freeing free inode. Report the correct block number when panicing with freeing free block.
|
74433 |
19-Mar-2001 |
rwatson |
o Change options FFS_EXTATTR and options FFS_EXTATTR_AUTOSTART to options UFS_EXTATTR and UFS_EXTATTR_AUTOSTART respectively. This change reflects the fact that our EA support is implemented entirely at the UFS layer (modulo FFS start/stop/autostart hooks for mount and unmount events). This also better reflects the fact that [shortly] MFS will also support EAs, as well as possibly IFS.
o Consumers of the EA support in FFS are reminded that as a result, they must change kernel config files to reflect the new option names.
Obtained from: TrustedBSD Project
|
74234 |
14-Mar-2001 |
rwatson |
o Implement "options FFS_EXTATTR_AUTOSTART", which depends on "options FFS_EXTATTR". When extended attribute auto-starting is enabled, FFS will scan the .attribute directory off of the root of each file system, as it is mounted. If .attribute exists, EA support will be started for the file system. If there are files in the directory, FFS will attempt to start them as attribute backing files for attributes baring the same name. All attributes are started before access to the file system is permitted, so this permits race-free enabling of attributes. For attributes backing support for security features, such as ACLs, MAC, Capabilities, this is vital, as it prevents the file system attributes from getting out of sync as a result of file system operations between mount-time and the enabling of the extended attribute. The userland extattrctl tool will still function exactly as previously. Files must be placed directly in .attribute, which must be directly off of the file system root: symbolic links are not permitted. FFS_EXTATTR will continue to be able to function without FFS_EXTATTR_AUTOSTART for sites that do not want/require auto-starting. If you're using the UFS_ACL code available from www.TrustedBSD.org, using FFS_EXTATTR_AUTOSTART is recommended.
o This support is implemented by adding an invocation of ufs_extattr_autostart() to ffs_mountfs(). In addition, several new supporting calls are introduced in ufs_extattr.c:
ufs_extattr_autostart(): start EAs on the specified mount ufs_extattr_lookup(): given a directory and filename, return the vnode for the file. ufs_extattr_enable_with_open(): invoke ufs_extattr_enable() after doing the equililent of vn_open() on the passed file. ufs_extattr_iterate_directory(): iterate over a directory, invoking ufs_extattr_lookup() and ufs_extattr_enable_with_open() on each entry.
o This feature is not widely tested, and therefore may contain bugs, caution is advised. Several changes are in the pipeline for this feature, including breaking out of EA namespaces into subdirectories of .attribute (this is waiting on the updated EA API), as well as a per-filesystem flag indicating whether or not EAs should be auto-started. This is required because administrators may not want .attribute auto-started on all file systems, especially if non-administrators have write access to the root of a file system.
Obtained from: TrustedBSD Project
|
73942 |
07-Mar-2001 |
mckusick |
Fixes to track snapshot copy-on-write checking in the specinfo structure rather than assuming that the device vnode would reside in the FFS filesystem (which is obviously a broken assumption with the device filesystem).
|
73287 |
01-Mar-2001 |
mckusick |
Free lock before returning from process_worklist_item.
Obtained from: Constantine Sapuntzakis <csapuntz@stanford.edu>
|
73286 |
01-Mar-2001 |
adrian |
Reviewed by: jlemon
An initial tidyup of the mount() syscall and VFS mount code.
This code replaces the earlier work done by jlemon in an attempt to make linux_mount() work.
* the guts of the mount work has been moved into vfs_mount().
* move `type', `path' and `flags' from being userland variables into being kernel variables in vfs_mount(). `data' remains a pointer into userspace.
* Attempt to verify the `type' and `path' strings passed to vfs_mount() aren't too long.
* rework mount() and linux_mount() to take the userland parameters (besides data, as mentioned) and pass kernel variables to vfs_mount(). (linux_mount() already did this, I've just tidied it up a little more.)
* remove the copyin*() stuff for `path'. `data' still requires copyin*() since its a pointer into userland.
* set `mount->mnt_statf_mntonname' in vfs_mount() rather than in each filesystem. This variable is generally initialised with `path', and each filesystem can override it if they want to.
* NOTE: f_mntonname is intiailised with "/" in the case of a root mount.
|
72941 |
23-Feb-2001 |
mckusick |
Free lock before calling panic so that subsequent attempt to write out buffers does not re-panic with `locking against myself'. This change should not affect normal operations of soft updates in any way.
|
72872 |
22-Feb-2001 |
mckusick |
When cleaning up excess inode dependencies, check for being done.
Reviewed by: Jan Koum <jkb@yahoo-inc.com>
|
72765 |
20-Feb-2001 |
mckusick |
This patch corrects two problems with the rate limiting code that was introduced in revision 1.80. The problem manifested itself with a `locking against myself' panic and could also result in soft updates inconsistences associated with inodedeps. The two problems are:
1) One of the background operations could manipulate the bitmap while holding it locked with intent to create. This held lock results in a `locking against myself' panic, when the background processing that we have been coopted to do tries to lock the bitmap which we are already holding locked. To understand how to fix this problem, first, observe that we can do the background cleanups in inodedep_lookup only when allocating inodedeps (DEPALLOC is set in the call to inodedep_lookup). Second observe that calls to inodedep_lookup with DEPALLOC set can only happen from the following calls into the softdep code:
softdep_setup_inomapdep softdep_setup_allocdirect softdep_setup_remove softdep_setup_freeblocks softdep_setup_directory_change softdep_setup_directory_add softdep_change_linkcnt
Only the first two of these can come from ffs_alloc.c while holding a bitmap locked. Thus, inodedep_lookup must not go off to do request_cleanups when being called from these functions. This change adds a flag, NODELAY, that can be passed to inodedep_lookup to let it know that it should not do background processing in those cases.
2) The return value from request_cleanup when helping out with the cleanup was 0 instead of 1. This meant that despite the fact that we may have slept while doing the cleanups, the code did not recheck for the appearance of an inodedep (e.g., goto top in inodedep_lookup). This lead to the softdep inconsistency in which we ended up with two inodedep's for the same inode.
Reviewed by: Peter Wemm <peter@yahoo-inc.com>, Matt Dillon <dillon@earth.backplane.com>
|
72645 |
18-Feb-2001 |
asmodai |
Preceed/preceeding are not english words. Use precede and preceding.
|
72376 |
12-Feb-2001 |
jake |
Implement a unified run queue and adjust priority levels accordingly.
- All processes go into the same array of queues, with different scheduling classes using different portions of the array. This allows user processes to have their priorities propogated up into interrupt thread range if need be. - I chose 64 run queues as an arbitrary number that is greater than 32. We used to have 4 separate arrays of 32 queues each, so this may not be optimal. The new run queue code was written with this in mind; changing the number of run queues only requires changing constants in runq.h and adjusting the priority levels. - The new run queue code takes the run queue as a parameter. This is intended to be used to create per-cpu run queues. Implement wrappers for compatibility with the old interface which pass in the global run queue structure. - Group the priority level, user priority, native priority (before propogation) and the scheduling class into a struct priority. - Change any hard coded priority levels that I found to use symbolic constants (TTIPRI and TTOPRI). - Remove the curpriority global variable and use that of curproc. This was used to detect when a process' priority had lowered and it should yield. We now effectively yield on every interrupt. - Activate propogate_priority(). It should now have the desired effect without needing to also propogate the scheduling class. - Temporarily comment out the call to vm_page_zero_idle() in the idle loop. It interfered with propogate_priority() because the idle process needed to do a non-blocking acquire of Giant and then other processes would try to propogate their priority onto it. The idle process should not do anything except idle. vm_page_zero_idle() will return in the form of an idle priority kernel thread which is woken up at apprioriate times by the vm system. - Update struct kinfo_proc to the new priority interface. Deliberately change its size by adjusting the spare fields. It remained the same size, but the layout has changed, so userland processes that use it would parse the data incorrectly. The size constraint should really be changed to an arbitrary version number. Also add a debug.sizeof sysctl node for struct kinfo_proc.
|
72200 |
09-Feb-2001 |
bmilekic |
Change and clean the mutex lock interface.
mtx_enter(lock, type) becomes:
mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks) mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)
similarily, for releasing a lock, we now have:
mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN. We change the caller interface for the two different types of locks because the semantics are entirely different for each case, and this makes it explicitly clear and, at the same time, it rids us of the extra `type' argument.
The enter->lock and exit->unlock change has been made with the idea that we're "locking data" and not "entering locked code" in mind.
Further, remove all additional "flags" previously passed to the lock acquire/release routines with the exception of two:
MTX_QUIET and MTX_NOSWITCH
The functionality of these flags is preserved and they can be passed to the lock/unlock routines by calling the corresponding wrappers:
mtx_{lock, unlock}_flags(lock, flag(s)) and mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN locks, respectively.
Re-inline some lock acq/rel code; in the sleep lock case, we only inline the _obtain_lock()s in order to ensure that the inlined code fits into a cache line. In the spin lock case, we inline recursion and actually only perform a function call if we need to spin. This change has been made with the idea that we generally tend to avoid spin locks and that also the spin locks that we do have and are heavily used (i.e. sched_lock) do recurse, and therefore in an effort to reduce function call overhead for some architectures (such as alpha), we inline recursion for this case.
Create a new malloc type for the witness code and retire from using the M_DEV type. The new type is called M_WITNESS and is only declared if WITNESS is enabled.
Begin cleaning up some machdep/mutex.h code - specifically updated the "optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently need those.
Finally, caught up to the interface changes in all sys code.
Contributors: jake, jhb, jasone (in no particular order)
|
72012 |
04-Feb-2001 |
phk |
Another round of the <sys/queue.h> FOREACH transmogriffer.
Created with: sed(1) Reviewed by: md5(1)
|
71999 |
04-Feb-2001 |
phk |
Mechanical change to use <sys/queue.h> macro API instead of fondling implementation details.
Created with: sed(1) Reviewed by: md5(1)
|
71998 |
04-Feb-2001 |
phk |
Use <sys/queue.h> macro API.
|
71820 |
30-Jan-2001 |
dillon |
Fix a race between the syncer and umount. When you umount a softupdates filesystem softdep_process_worklist() is called in a loop until it indicates that no dependancies remain, but the determination of that fact depends on there only being one softdep_process_worklist() instance running. It was possible for the syncer to also be running softdep_process_worklist() and the pre-existing checks in the code to prevent this were not sufficient to prevent the race. This patch solves the problem.
Approved-by: mckusick
|
71576 |
24-Jan-2001 |
jasone |
Convert all simplelocks to mutexes and remove the simplelock implementations.
|
71073 |
15-Jan-2001 |
iedowse |
The ffs superblock includes a 128-byte region for use by temporary in-core pointers to summary information. An array in this region (fs_csp) could overflow on filesystems with a very large number of cylinder groups (~16000 on i386 with 8k blocks). When this happens, other fields in the superblock get corrupted, and fsck refuses to check the filesystem.
Solve this problem by replacing the fs_csp array in 'struct fs' with a single pointer, and add padding to keep the length of the 128-byte region fixed. Update the kernel and userland utilities to use just this single pointer.
With this change, the kernel no longer makes use of the superblock fields 'fs_csshift' and 'fs_csmask'. Add a comment to newfs/mkfs.c to indicate that these fields must be calculated for compatibility with older kernels.
Reviewed by: mckusick
|
70980 |
12-Jan-2001 |
mckusick |
Properly compute the size of the final block of superblock summary information.
Submitted by: Ian Dowse <iedowse@maths.tcd.ie>
|
70183 |
19-Dec-2000 |
mckusick |
Several small but important fixes for snapshots:
1) Be more tolerant of missing snapshot files by only trying to decrement their reference count if they are registered as active.
2) Fix for snapshots of filesystems with block sizes larger than 8K (from Ollivier Robert <roberto@eurocontrol.fr>).
3) Fix to avoid losing last block in snapshot file when calculating blocks that need to be copied (from Don Coleman <coleman@coleman.org>).
|
70182 |
19-Dec-2000 |
mckusick |
Get rid of spurious check in ffs_truncate for i_size == length which fails to set the modification time on the file. The same check a few lines later takes the correct action.
Submitted by: Ian Dowse <iedowse@maths.tcd.ie>
|
70132 |
17-Dec-2000 |
assar |
add a stub for softdep_slowdown so that it's possible to build the kernel without SOFTUPDATES
|
69974 |
13-Dec-2000 |
tanimura |
Do not race for the lock of an inode hash.
Reviewed by: jhb
|
69967 |
13-Dec-2000 |
mckusick |
Preventing runaway kernel soft updates memory, take three. Previously, the syncer process was the only process in the system that could process the soft updates background work list. If enough other processes were adding requests to that list, it would eventually grow without bound. Because some of the work list requests require vnodes to be locked, it was not generally safe to let random processes process the work list while they already held vnodes locked. By adding a flag to the work list queue processing function to indicate whether the calling process could safely lock vnodes, it becomes possible to co-opt other processes into helping out with the work list. Now when the worklist gets too large, other processes can safely help out by picking off those work requests that can be handled without locking a vnode, leaving only the small number of requests requiring a vnode lock for the syncer process. With this change, it appears possible to keep even the nastiest workloads under control.
Submitted by: Paul Saab <ps@yahoo-inc.com>
|
69781 |
08-Dec-2000 |
dwmalone |
Convert more malloc+bzero to malloc+M_ZERO.
Submitted by: josh@zipperup.org Submitted by: Robert Drehmel <robd@gmx.net>
|
69774 |
08-Dec-2000 |
phk |
Staticize some malloc M_ instances.
|
68933 |
20-Nov-2000 |
mckusick |
More aggressively rate limit the growth of soft dependency structures in the face of multiple processes doing massive numbers of filesystem operations. While this patch will work in nearly all situations, there are still some perverse workloads that can overwhelm the system. Detecting and handling these perverse workloads will be the subject of another patch.
Reviewed by: Paul Saab <ps@yahoo-inc.com> Obtained from: Ethan Solomita <ethan@geocast.com>
|
68885 |
18-Nov-2000 |
dillon |
Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory situations prior to now.
The new code is based on the concept that I/O must be able to function in a low memory situation. All major modules related to I/O (except networking) have been adjusted to allow allocation out of the system reserve memory pool. These modules now detect a low memory situation but rather then block they instead continue to operate, then return resources to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode being locked.
Thus situations where a process blocks in a low-memory condition while holding a locked vnode have been reduced to near nothing. Not only will I/O continue to operate, but many prior deadlock conditions simply no longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop was not properly incrementing loop variables prior to a continue statement. We do not believe this code can be hit anyway but we aren't taking any chances. We'll turn the whole section into a panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly clamped to the iosize, causing the wrong foff to be calculated for pages in the case of an I/O error or biodone() called without initiating I/O. The problem always caused a panic before. Now it doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only because the calculations use signed arithmatic. Better to properly extend PAGE_MASK first before inverting it for the 64 bit masking op.
In brelse(), the bogus_page fixup code was improperly throwing away the original contents of 'm' when it did the j-loop to fix the bogus pages. The result was that it would potentially invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is being duplicated, causing potential corruption. We have identified a potentially serious bug related to this but the fix is still TBD. So instead this patch contains a KASSERT to detect the problem and panic the machine rather then continue to corrupt the filesystem. The problem does not occur very often.. it is very hard to reproduce, and it may or may not be the cause of the corruption people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>) Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
|
68715 |
14-Nov-2000 |
mckusick |
When deleting a file, the ordering of events imposed by soft updates is to first write the deleted directory entry to disk, second write the zero'ed inode to disk, and finally to release the freed blocks and the inode back to the cylinder-group map. As this ordering requires two disk writes to occur which are normally spaced about 30 seconds apart (except when memory is under duress), it takes about a minute from the time that a file is deleted until its inode and data blocks show up in the cylinder-group map for reallocation. If a file has had only a brief lifetime (less than 30 seconds from creation to deletion), neither its inode nor its directory entry may have been written to disk. If its directory entry has not been written to disk, then we need not wait for that directory block to be written as the on-disk directory block does not reference the inode. Similarly, if the allocated inode has never been written to disk, we do not have to wait for it to be written back either as its on-disk representation is still zero'ed out. Thus, in the case of a short lived file, we can simply release the blocks and inode to the cylinder-group map immediately. As the inode and its blocks are released immediately, they are immediately available for other uses. If they are not released for a minute, then other inodes and blocks must be allocated for short lived files, cluttering up the vnode and buffer caches. The previous code was a bit too aggressive in trying to release the blocks and inode back to the cylinder-group map resulting in their being made available when in fact the inode on disk had not yet been zero'ed. This patch takes a more conservative approach to doing the release which avoids doing the release prematurely.
|
67106 |
14-Oct-2000 |
adrian |
Initial commit of IFS - a inode-namespaced FFS. Here is a short description:
How it works: --
Basically ifs is a copy of ffs, overriding some vfs/vnops. (Yes, hack.) I didn't see the need in duplicating all of sys/ufs/ffs to get this off the ground.
File creation is done through a special file - 'newfile' . When newfile is called, the system allocates and returns an inode. Note that newfile is done in a cloning fashion:
fd = open("newfile", O_CREAT|O_RDWR, 0644); fstat(fd, &st);
printf("new file is %d\n", (int)st.st_ino);
Once you have created a file, you can open() and unlink() it by its returned inode number retrieved from the stat call, ie:
fd = open("5", O_RDWR);
The creation permissions depend entirely if you have write access to the root directory of the filesystem.
To get the list of currently allocated inodes, VOP_READDIR has been added which returns a directory listing of those currently allocated.
--
What this entails:
* patching conf/files and conf/options to include IFS as a new compile option (and since ifs depends upon FFS, include the FFS routines)
* An entry in i386/conf/NOTES indicating IFS exists and where to go for an explanation
* Unstaticize a couple of routines in src/sys/ufs/ffs/ which the IFS routines require (ffs_mount() and ffs_reload())
* a new bunch of routines in src/sys/ufs/ifs/ which implement the IFS routines. IFS replaces some of the vfsops, and a handful of vnops - most notably are VFS_VGET(), VOP_LOOKUP(), VOP_UNLINK() and VOP_READDIR(). Any other directory operation is marked as invalid.
What this results in:
* an IFS partition's create permissions are controlled by the perm/ownership of the root mount point, just like a normal directory
* Each inode has perm and ownership too
* IFS does *NOT* mean an FFS partition can be opened per inode. This is a completely seperate filesystem here
* Softupdates doesn't work with IFS, and really I don't think it needs it. Besides, fsck's are FAST. (Try it :-)
* Inodes 0 and 1 aren't allocatable because they are special (dump/swap IIRC). Inode 2 isn't allocatable since UFS/FFS locks all inodes in the system against this particular inode, and unravelling THAT code isn't trivial. Therefore, useful inodes start at 3.
Enjoy, and feedback is definitely appreciated!
|
66886 |
09-Oct-2000 |
eivind |
Blow away the v_specmountpoint define, replacing it with what it was defined as (rdev->si_mountpoint)
|
66753 |
06-Oct-2000 |
rwatson |
o Move initialization of ump from mp to the top of the function so that it is defined whenm used in ufs_extattr_uepm_destroy(), fixing a panic due to a NULL pointer dereference.
Submitted by: Wesley Morgan <morganw@chemicals.tacorp.com>
|
66617 |
04-Oct-2000 |
rwatson |
o Add call to ufs_extattr_uepm_destroy() in ffs_unmount() so as to clean up lock on extattrs. o Get for free a comment indicating where auto-starting of extended attributes will eventually occur, as it was in my commit tree also. No implementation change here, only a comment.
|
66615 |
04-Oct-2000 |
jasone |
Convert lockmgr locks from using simple locks to using mutexes.
Add lockdestroy() and appropriate invocations, which corresponds to lockinit() and must be called to clean up after a lockmgr lock is no longer needed.
|
66355 |
25-Sep-2000 |
bp |
Add a lock structure to vnode structure. Previously it was either allocated separately (nfs, cd9660 etc) or keept as a first element of structure referenced by v_data pointer(ffs). Such organization leads to known problems with stacked filesystems.
From this point vop_no*lock*() functions maintain only interlock lock. vop_std*lock*() functions maintain built-in v_lock structure using lockmgr(). vop_sharedlock() is compatible with vop_stdunlock(), but maintains a shared lock on vnode.
If filesystem wishes to export lockmgr compatible lock, it can put an address of this lock to v_vnlock field. This indicates that the upper filesystem can take advantage of it and use single lock structure for entire (or part) of stack of vnodes. This field shouldn't be examined or modified by VFS code except for initialization purposes.
Reviewed in general by: mckusick
|
66187 |
21-Sep-2000 |
rwatson |
o Permit UFS Extended Attributes to be associated with special devices and FIFOs.
Obtained from: TrustedBSD Project
|
65998 |
17-Sep-2000 |
des |
Silence a warning.
|
65595 |
07-Sep-2000 |
mckusick |
Cannot do MALLOC with M_WAITOK while holding ACQUIRE_LOCK
Obtained from: Ethan Solomita <ethan@geocast.com>
|
65557 |
07-Sep-2000 |
jasone |
Major update to the way synchronization is done in the kernel. Highlights include:
* Mutual exclusion is used instead of spl*(). See mutex(9). (Note: The alpha port is still in transition and currently uses both.)
* Per-CPU idle processes.
* Interrupts are run in their own separate kernel threads and can be preempted (i386 only).
Partially contributed by: BSDi (BSD/OS) Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh
|
64437 |
09-Aug-2000 |
tegge |
Initialize *countp to 0 in stub for softdep_flushworklist(). This allows ffs_fsync() to break out of a loop that might otherwise be infinite on kernels compiled without the SOFTUPDATES option. The observed symptom was a system hang at the first unmount attempt.
|
64104 |
01-Aug-2000 |
roberto |
Fix the lockmgr panic everyone is seeing at shutdown time. vput assumes curproc is the lock holder, but it's not true in this case.
Thanks a lot Luoqi !
Submitted by: luoqi Tested by: phk
|
63975 |
28-Jul-2000 |
peter |
Minor change: fix warning - move a 'struct vnode *vp' declaration inside a #ifdef DIAGNOSTIC to match its corresponding usage.
|
63897 |
26-Jul-2000 |
mckusick |
Clean up the snapshot code so that it no longer depends on the use of the SF_IMMUTABLE flag to prevent writing. Instead put in explicit checking for the SF_SNAPSHOT flag in the appropriate places. With this change, it is now possible to rename and link to snapshot files. It is also possible to set or clear any of the owner, group, or other read bits on the file, though none of the write or execute bits can be set. There is also an explicit test to prevent the setting or clearing of the SF_SNAPSHOT flag via chflags() or fchflags(). Note also that the modify time cannot be changed as it needs to accurately reflect the time that the snapshot was taken.
Submitted by: Robert Watson <rwatson@FreeBSD.org>
|
63829 |
25-Jul-2000 |
mckusick |
Add stub for softdep_flushworklist() so that kernels compiled without the SOFTUPDATES option will load correctly.
Obtained from: John Baldwin <jhb@bsdi.com>
|
63788 |
24-Jul-2000 |
mckusick |
This patch corrects the first round of panics and hangs reported with the new snapshot code.
Update addaliasu to correctly implement the semantics of the old checkalias function. When a device vnode first comes into existence, check to see if an anonymous vnode for the same device was created at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than creating a new vnode for the device. This corrects a problem which caused the kernel to panic when taking a snapshot of the root filesystem.
Change the calling convention of vn_write_suspend_wait() to be the same as vn_start_write().
Split out softdep_flushworklist() from softdep_flushfiles() so that it can be used to clear the work queue when suspending filesystem operations.
Access to buffers becomes recursive so that snapshots can recursively traverse their indirect blocks using ffs_copyonwrite() when checking for the need for copy on write when flushing one of their own indirect blocks. This eliminates a deadlock between the syncer daemon and a process taking a snapshot.
Ensure that softdep_process_worklist() can never block because of a snapshot being taken. This eliminates a problem with buffer starvation.
Cleanup change in ffs_sync() which did not synchronously wait when MNT_WAIT was specified. The result was an unclean filesystem panic when doing forcible unmount with heavy filesystem I/O in progress.
Return a zero'ed block when reading a block that was not in use at the time that a snapshot was taken. Normally, these blocks should never be read. However, the readahead code will occationally read them which can cause unexpected behavior.
Clean up the debugging code that ensures that no blocks be written on a filesystem while it is suspended. Snapshots must explicitly label the blocks that they are writing during the suspension so that they do not cause a `write on suspended filesystem' panic.
Reorganize ffs_copyonwrite() to eliminate a deadlock and also to prevent a race condition that would permit the same block to be copied twice. This change eliminates an unexpected soft updates inconsistency in fsck caused by the double allocation.
Use bqrelse rather than brelse for buffers that will be needed soon again by the snapshot code. This improves snapshot performance.
|
63059 |
13-Jul-2000 |
bp |
Prevent possible dereference of NULL pointer.
Submitted by: Marius Bendiksen <mbendiks@eunet.no>
|
62985 |
12-Jul-2000 |
mckusick |
Brain fault, forgot to update ffs_snapshot.c with the new calling convention for vn_start_write.
|
62976 |
11-Jul-2000 |
mckusick |
Add snapshots to the fast filesystem. Most of the changes support the gating of system calls that cause modifications to the underlying filesystem. The gating can be enabled by any filesystem that needs to consistently suspend operations by adding the vop_stdgetwritemount to their set of vnops. Once gating is enabled, the function vfs_write_suspend stops all new write operations to a filesystem, allows any filesystem modifying system calls already in progress to complete, then sync's the filesystem to disk and returns. The function vfs_write_resume allows the suspended write operations to begin again. Gating is not added by default for all filesystems as for SMP systems it adds two extra locks to such critical kernel paths as the write system call. Thus, gating should only be added as needed.
Details on the use and current status of snapshots in FFS can be found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness is not included here. Unless and until you create a snapshot file, these changes should have no effect on your system (famous last words).
|
62968 |
11-Jul-2000 |
mckusick |
Clean up warning about undeclared function by declaring softdep_fsync in mount.h instead of ffs_extern.h. The correct solution is to use an indirect function pointer so that the kernel does not have to be built with options FFS, but that will be left for another day.
|
62799 |
08-Jul-2000 |
mckusick |
Delete README as it is now obsolete. Relevant information is in README.softupdates.
|
62798 |
08-Jul-2000 |
mckusick |
Update to reflect current status.
|
62553 |
04-Jul-2000 |
mckusick |
Get userland visible flags added for snapshots to give a few days advance preparation for them to get migrated into place so that subsequent changes in utilities will not fail to compile for lack of up-to-date header files in /usr/include.
|
62469 |
03-Jul-2000 |
phk |
Make the two calls from kern/* into softupdates #ifdef SOFTUPDATES, that is way cleaner than using the softupdates_stub stunt, which should be killed when convenient.
Discussed with: mckusick
|
62033 |
24-Jun-2000 |
ache |
Remove obsoleted info about linking from contrib
|
61926 |
22-Jun-2000 |
mckusick |
Update to new copyright.
|
61813 |
18-Jun-2000 |
mckusick |
When running with quotas enabled on a filesystem using soft updates, the system would panic when a user's inode quota was exceeded (see PR 18959 for details). This fixes that problem.
PR: 18959 Submitted by: Jason Godsey <jason@unixguy.fidalgo.net>
|
61812 |
18-Jun-2000 |
mckusick |
Some additional performance improvements. When freeing an inode check to see if it has been committed to disk. If it has never been written, it can be freed immediately. For short lived files this change allows the same inode to be reused repeatedly. Similarly, when upgrading a fragment to a larger size, if it has never been claimed by an inode on disk, it too can be freed immediately making it available for reuse often in the next slowly growing block of the same file.
|
61730 |
16-Jun-2000 |
phk |
Revert part of my bioops change which implemented panic(8).
|
61729 |
16-Jun-2000 |
phk |
ARGH! I have too many source trees :-(
Fix prototype errors in last commit.
|
61724 |
16-Jun-2000 |
phk |
Virtualizes & untangles the bioops operations vector.
Ref: Message-ID: <18317.961014572@critter.freebsd.dk> To: current@
|
61698 |
14-Jun-2000 |
phk |
Remove a comment which should never have made it in.
|
61237 |
04-Jun-2000 |
rwatson |
o If FFS_EXTATTR is defined, don't print out an error message on unmount if an FFS partition returns EOPNOTSUPP, as it just means extended attributes weren't enabled on that partition. Prevents spurious warning per-partition at shutdown.
|
60938 |
26-May-2000 |
jake |
Back out the previous change to the queue(3) interface. It was not discussed and should probably not happen.
Requested by: msmith and others
|
60833 |
23-May-2000 |
jake |
Change the way that the queue(3) structures are declared; don't assume that the type argument to *_HEAD and *_ENTRY is a struct.
Suggested by: phk Reviewed by: phk Approved by: mdodd
|
60165 |
07-May-2000 |
rwatson |
s/ffs_unmonut/ffs_unmount/ in a gratuitous ufs_extattr printf.
Reported by: knu
|
60041 |
05-May-2000 |
phk |
Separate the struct bio related stuff out of <sys/buf.h> into <sys/bio.h>.
<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall not be made a nested include according to bdes teachings on the subject of nested includes.
Diskdrivers and similar stuff below specfs::strategy() should no longer need to include <sys/buf.> unless they need caching of data.
Still a few bogus uses of struct buf to track down.
Repocopy by: peter
|
59794 |
30-Apr-2000 |
phk |
Remove unneeded #include <vm/vm_zone.h>
Generated by: src/tools/tools/kerninclude
|
59762 |
29-Apr-2000 |
phk |
s/biowait/bufwait/g
Prodded by: several.
|
59760 |
29-Apr-2000 |
phk |
Remove unneeded #include <sys/kernel.h>
|
59391 |
19-Apr-2000 |
phk |
Remove ~25 unneeded #include <sys/conf.h> Remove ~60 unneeded #include <sys/malloc.h>
|
59241 |
15-Apr-2000 |
rwatson |
Introduce extended attribute support for FFS, allowing arbitrary (name, value) pairs to be associated with inodes. This support is used for ACLs, MAC labels, and Capabilities in the TrustedBSD security extensions, which are currently under development.
In this implementation, attributes are backed to data vnodes in the style of the quota support in FFS. Support for FFS extended attributes may be enabled using the FFS_EXTATTR kernel option (disabled by default). Userland utilities and man pages will be committed in the next batch. VFS interfaces and man pages have been in the repo since 4.0-RELEASE and are unchanged.
o ufs/ufs/extattr.h: UFS-specific extattr defines o ufs/ufs/ufs_extattr.c: bulk of support routines o ufs/{ufs,ffs,mfs}/*.[ch]: hooks and extattr.h includes o contrib/softupdates/ffs_softdep.c: extattr.h includes o conf/options, conf/files, i386/conf/LINT: added FFS_EXTATTR
o coda/coda_vfsops.c: XXX required extattr.h due to ufsmount.h (This should not be the case, and will be fixed in a future commit)
Currently attributes are not supported in MFS. This will be fixed.
Reviewed by: adrian, bp, freebsd-fs, other unthanked souls Obtained from: TrustedBSD Project
|
58934 |
02-Apr-2000 |
phk |
Move B_ERROR flag to b_ioflags and call it BIO_ERROR.
(Much of this done by script)
Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED.
Move b_pblkno and b_iodone_chain to struct bio while we transition, they will be obsoleted once bio structs chain/stack.
Add bio_queue field for struct bio aware disksort.
Address a lot of stylistic issues brought up by bde.
|
58349 |
20-Mar-2000 |
phk |
Rename the existing BUF_STRATEGY() to DEV_STRATEGY()
substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)
substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)
This patch is machine generated except for the ccd.c and buf.h parts.
|
58345 |
20-Mar-2000 |
phk |
Remove B_READ, B_WRITE and B_FREEBUF and replace them with a new field in struct buf: b_iocmd. The b_iocmd is enforced to have exactly one bit set.
B_WRITE was bogusly defined as zero giving rise to obvious coding mistakes.
Also eliminate the redundant struct buf flag B_CALL, it can just as efficiently be done by comparing b_iodone to NULL.
Should you get a panic or drop into the debugger, complaining about "b_iocmd", don't continue. It is likely to write on your disk where it should have been reading.
This change is a step in the direction towards a stackable BIO capability.
A lot of this patch were machine generated (Thanks to style(9) compliance!)
Vinum users: Greg has not had time to test this yet, be careful.
|
58155 |
17-Mar-2000 |
mckusick |
Use 64-bit math to calculate if we have hit our freespace limit. Necessary for coherent results on filesystems bigger than 0.5Tb.
|
58087 |
15-Mar-2000 |
mckusick |
Use 64-bit math to decide if optimization needs to be changed. Necessary for coherent results on filesystems bigger than 0.5Tb.
Submitted by: Paul Saab <ps@yahoo-inc.com>
|
57446 |
24-Feb-2000 |
dillon |
Fix a 'freeing free block' panic in UFS. The problem occurs when the filesystem fills up. If the first indirect block exists and FFS is able to allocate deeper indirect blocks, but is not able to allocate the data block, FFS improperly unwinds the indirect blocks and leaves a block pointer hanging to a freed block. This will cause a panic later when the file is removed. The solution is to properly account for the first block-pointer-to-an-indirect-block we had to create in a balloc operation and then unwind it if a failure occurs.
Detective work by: Ian Dowse <iedowse@maths.tcd.ie> Reviewed by: mckusick, Ian Dowse <iedowse@maths.tcd.ie> Approved by: jkh
|
56908 |
30-Jan-2000 |
mckusick |
When writing out bitmap buffers, need to skip over ones that already have a write in progress. Otherwise one can get in an infinite loop trying to get them all flushed.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
|
56209 |
18-Jan-2000 |
mckusick |
During fastpath processing for removal of a short-lived inode, the set of restrictions for cancelling an inode dependency (inodedep) is somewhat stronger than originally coded. Since this check appears in two places, we codify it into the function check_inode_unwritten which we then call from the two sites, one freeing blocks and the other freeing directory entries.
Submitted by: Steinar Haug via Matthew Dillon
|
56208 |
18-Jan-2000 |
mckusick |
Need to reorganize the flushing of directory entry (pagedep) dependencies so that they never try to lock an inode corresponding to ".." as this can lead to deadlock. We observe that any inode with an updated link count is always pushed into its buffer at the time of the link count change, so we do not need to do a VOP_UPDATE, but merely find its buffer and write it. The only time we need to get the inode itself is from the result of a mkdir whose name will never be ".." and hence locking such an inode will never request a lock above us in the filesystem tree. Thanks to Brian Fundakowski Feldman for providing the test program that tickled soft updates into hanging in "inode" sleep.
Submitted by: Brian Fundakowski Feldman <green@FreeBSD.org>
|
56150 |
17-Jan-2000 |
mckusick |
Better bounding on softdep_flushfiles; other minor tweeks to checks.
|
56149 |
17-Jan-2000 |
mckusick |
Must track multiple uncommitted renames until one ultimately gets committed to disk or is removed.
|
55947 |
14-Jan-2000 |
dillon |
Non-operational change, fix compiler warning.
Reviewed by: mckusick
|
55931 |
13-Jan-2000 |
mckusick |
Confirming Peter's fix (locking 101: release the lock before you go to sleep). Locking 101, part 2: do not look at buffer contents after you have been asleep. There is no telling what wonderous changes may have occurred.
|
55928 |
13-Jan-2000 |
peter |
Free the global softupdates lock prior to tsleep() in getdirtybuf(). This seems to be responsible for a bunch of panics where the process sleeps and something else finds softupdates "locked" when it shouldn't be. This commit is unreviewed, but has been a big help here. Previously my boxes would panic pretty much on the first fsync() that wrote something to disk.
|
55886 |
13-Jan-2000 |
mckusick |
Because cylinder group blocks are now written in background, it is no longer sufficient to get a lock on a buffer to know that its write has been completed. We have to first get the lock on the buffer, then check to see if it is doing a background write. If it is doing background write, we have to wait for the background write to finish, then check to see if that fullfilled our dependency, and if not to start another write. Luckily the explanation is longer than the fix.
|
55885 |
13-Jan-2000 |
mckusick |
A panic occurs during an fsync when a dirty block associated with a vnode has not been written (which would clear certain of its dependencies). The problems arises because fsync with MNT_NOWAIT no longer pushes all the dirty blocks associated with a vnode. It skips those that require rollbacks, since they will just get instantly dirty again. Such skipped blocks are marked so that they will not be skipped a second time (otherwise circular dependencies would never clear). So, we fsync twice to ensure that everything will be written at least once.
|
55799 |
11-Jan-2000 |
mckusick |
The only known cause of this panic is running out of disk space. The problem occurs when an indirect block and a data block are being allocated at the same time. For example when the 13th block of the file is written, the filesystem needs to allocate the first indirect block and a data block. If the indirect block allocation succeeds, but the data block allocation fails, the error code dellocates the indirect block as it has nothing at which to point. Unfortunately, it does not deallocate the indirect block's associated dependencies which then fail when they find the block unexpectedly gone (ptr == 0 instead of its expected value). The fix is to fsync the file before doing the block rollback, as the fsync will flush out all of the dependencies. Once the rollback is done the file must be fsync'ed again so that the soft updates code does not find unexpected changes. This approach is much slower than writing the code to back out the extraneous dependencies, but running out of disk space is not expected to be a common occurence, so just getting it right is the main criterion.
PR: kern/15063 Submitted by: Assar Westerlund <assar@stacken.kth.se>
|
55794 |
11-Jan-2000 |
mckusick |
We cannot proceed to free the blocks of the file until the dependencies have been cleaned up by deallocte_dependencies(). Once that is done, it is safe to post the request to free the blocks. A similar change is also needed for the freefile case.
|
55756 |
10-Jan-2000 |
phk |
Give vn_isdisk() a second argument where it can return a suitable errno.
Suggested by: bde
|
55726 |
10-Jan-2000 |
mckusick |
Missing FREE_LOCK call before handle_workitem_freeblocks.
Submitted by: "Kenneth D. Merry" <ken@kdm.org>
|
55697 |
10-Jan-2000 |
mckusick |
Several performance improvements for soft updates have been added: 1) Fastpath deletions. When a file is being deleted, check to see if it was so recently created that its inode has not yet been written to disk. If so, the delete can proceed to immediately free the inode. 2) Background writes: No file or block allocations can be done while the bitmap is being written to disk. To avoid these stalls, the bitmap is copied to another buffer which is written thus leaving the original available for futher allocations. 3) Link count tracking. Constantly track the difference in i_effnlink and i_nlink so that inodes that have had no change other than i_effnlink need not be written. 4) Identify buffers with rollback dependencies so that the buffer flushing daemon can choose to skip over them.
|
55694 |
09-Jan-2000 |
mckusick |
Keep tighter control of removal dependencies by limiting the number of dirrem structure rather than the collaterally created freeblks and freefile structures. Limit the rate of buffer dirtying by the syncer process during periods of intense file removal.
|
55692 |
09-Jan-2000 |
mckusick |
Reorganize softdep_fsync so that it only does the inode-is-flushed check before the inode is unlocked while grabbing its parent directory. Once it is unlocked, other operations may slip in that could make the inode-is-flushed check fail. Allowing other writes to the inode before returning from fsync does not break the semantics of fsync since we have flushed everything that was dirty at the time of the fsync call.
|
55691 |
09-Jan-2000 |
mckusick |
Get rid of unreferenced function.
|
55690 |
09-Jan-2000 |
mckusick |
Make static non-exported functions from soft updates.
|
55206 |
29-Dec-1999 |
peter |
Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL" is an application space macro and the applications are supposed to be free to use it as they please (but cannot). This is consistant with the other BSD's who made this change quite some time ago. More commits to come.
|
55029 |
23-Dec-1999 |
bde |
Update the unclean flag for mount -u. I forgot to handle this case when I made the absence of the clean flag sticky in rev.1.88. This was a problem main for "mount /". There is no way to mount "/" for writing without using mount -u (normally implicitly), so after "mount -f /" of an unclean filesystem, the absence of the clean flag was sticky forever.
|
54952 |
21-Dec-1999 |
eivind |
Change incorrect NULLs to 0s
|
54803 |
19-Dec-1999 |
rwatson |
Second pass commit to introduce new ACL and Extended Attribute system calls, vnops, vfsops, both in /kern, and to individual file systems that require a vfsop_ array entry.
Reviewed by: eivind
|
54700 |
16-Dec-1999 |
mckusick |
The function request_cleanup() had a tsleep() with PCATCH. It is quite dangerous, since the process may hold locks at the point, and if it is stopped in that tsleep the machine may hang. Because the sleep is so short, the PCATCH is not required here, so it has been removed. For the future, the FreeBSD team needs to decide whether it is still reasonable to stop a process in tsleep, as that may affect any other code that uses PCATCH while holding kernel locks.
Submitted by: Dmitrij Tejblum <tejblum@arc.hq.cti.ru> Reviewed by: Kirk McKusick <mckusick@mckusick.com>
|
54655 |
15-Dec-1999 |
eivind |
Introduce NDFREE (and remove VOP_ABORTOP)
|
54444 |
11-Dec-1999 |
eivind |
Lock reporting and assertion changes. * lockstatus() and VOP_ISLOCKED() gets a new process argument and a new return value: LK_EXCLOTHER, when the lock is held exclusively by another process. * The ASSERT_VOP_(UN)LOCKED family is extended to use what this gives them * Extend the vnode_if.src format to allow more exact specification than locked/unlocked.
This commit should not do any semantic changes unless you are using DEBUG_VFS_LOCKS.
Discussed with: grog, mch, peter, phk Reviewed by: peter
|
54049 |
03-Dec-1999 |
billf |
Remove the 'alpha, use at your own risk' death-statement.
Reviewed by: mckusick (verbally at FreeBSDcon)
|
54048 |
03-Dec-1999 |
billf |
Fix typo, add $FreeBSD$
|
53996 |
01-Dec-1999 |
mckusick |
Preferentially allocate the first indirect block in the same cylinder group as the inode. This makes a 15% difference in read speed for files in the 96K to 500K size range.
|
53577 |
22-Nov-1999 |
phk |
Convert various pieces of code to use vn_isdisk() rather than checking for vp->v_type == VBLK.
In ccd: we don't need to call VOP_GETATTR to find the type of a vnode.
Reviewed by: sos
|
53464 |
20-Nov-1999 |
eivind |
We do not have ffs_checkexp, so remove the prototype
|
53452 |
20-Nov-1999 |
phk |
struct mountlist and struct mount.mnt_list have no business being a CIRCLEQ. Change them to TAILQ_HEAD and TAILQ_ENTRY respectively.
This removes ugly mp != (void*)&mountlist comparisons.
Requested by: phk Submitted by: Jake Burkholder jake@checker.org PR: 14967
|
53059 |
09-Nov-1999 |
phk |
Next step in the device cleanup process.
Correctly lock vnodes when calling VOP_OPEN() from filesystem mount code.
Unify spec_open() for bdev and cdev cases.
Remove the disabled bdev specific read/write code.
|
52838 |
03-Nov-1999 |
bde |
Quick fix for breakage of ext2fs link counts as reported by stat(2) by the soft updates changes: only report the link count to be i_effnlink in ufs_getattr() for file systems that maintain i_effnlink.
Tested by: Mike Dracopoulos <mdraco@math.uoa.gr>
|
52782 |
01-Nov-1999 |
msmith |
Newline-terminate the complaint message about not being able to find the root vnode pointer.
|
52635 |
29-Oct-1999 |
phk |
useracc() the prequel:
Merge the contents (less some trivial bordering the silly comments) of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts the #defines for the vm_inherit_t and vm_prot_t types next to their typedefs.
This paves the road for the commit to follow shortly: change useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE} as argument.
|
51808 |
30-Sep-1999 |
phk |
Remove the D_NOCLUSTER[RW] options which were added because vn had problems. Now that Matt has fixed vn, this can go. The vn driver should have used d_maxio (now si_iosize_max) anyway.
|
51797 |
29-Sep-1999 |
phk |
Remove v_maxio from struct vnode.
Replace it with mnt_iosize_max in struct mount.
Nits from: bde
|
51138 |
11-Sep-1999 |
alfred |
Seperate the export check in VFS_FHTOVP, exports are now checked via VFS_CHECKEXP.
Add fh(open|stat|stafs) syscalls to allow userland to query filesystems based on (network) filehandle.
Obtained from: NetBSD
|
50480 |
28-Aug-1999 |
peter |
$Id$ -> $FreeBSD$
|
50477 |
28-Aug-1999 |
peter |
$Id$ -> $FreeBSD$
|
50347 |
25-Aug-1999 |
phk |
Introduce vn_isdisk(struct vnode *vp) function, and use it to test for diskness.
|
50305 |
24-Aug-1999 |
sheldonh |
Fix bug introduced in rev 1.28, which causes kernel build to break for the case where DEBUG is defined but not DIAGNOSTIC. ffs_checkblk is declared conditionally on DIAGNOSTIC, not DEBUG.
PR: 13314 Reviewed by: bde
|
50253 |
23-Aug-1999 |
bde |
Use devtoname() to print dev_t's instead of casting them to long or u_long for misprinting in %lx format.
|
49679 |
13-Aug-1999 |
phk |
The bdevsw() and cdevsw() are now identical, so kill the former.
|
49535 |
08-Aug-1999 |
phk |
Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>, a few lines into <sys/vnode.h>.
Add a few fields to struct specinfo, paving the way for the fun part.
|
48801 |
13-Jul-1999 |
mckusick |
Create the macro DOINGASYNC to check whether the MNT_ASYNC flag has been set for a mount point. Insert missing checks to ensure that all write operations are done asynchronously when the MNT_ASYNC option has been requested.
Submitted by: Craig A Soules <soules+@andrew.cmu.edu> Reviewed by: Kirk McKusick <mckusick@mckusick.com>
|
48759 |
11-Jul-1999 |
phk |
Use the fsid from the superblock, unless it looks bogus or has already been taken by some other filesystem.
|
48656 |
07-Jul-1999 |
roberto |
Add $Id$
Approved by: kirk
|
48536 |
03-Jul-1999 |
jdp |
Update pathnames for new location of soft-updates sources.
|
48334 |
29-Jun-1999 |
mckusick |
No longer need to set B_ASYNC flag since BUF_KERNPROC now unconditionally sets the identity of the buffer.
|
48276 |
27-Jun-1999 |
peter |
Keep the inlines for <sys/buf.h> happy..
|
48225 |
26-Jun-1999 |
mckusick |
Convert buffer locking from using the B_BUSY and B_WANTED flags to using lockmgr locks. This commit should be functionally equivalent to the old semantics. That is, all buffer locking is done with LK_EXCLUSIVE requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will be done in future commits.
|
47995 |
18-Jun-1999 |
mckusick |
On our final pass through ffs_fsync, do all I/O synchronously so that we can find out if our flush is failing because of write errors. This change avoids a "flush failed" panic during unrecoverable disk errors.
|
47964 |
16-Jun-1999 |
mckusick |
Add a vnode argument to VOP_BWRITE to get rid of the last vnode operator special case. Delete special case code from vnode_if.sh, vnode_if.src, umap_vnops.c, and null_vnops.c.
|
47940 |
15-Jun-1999 |
mckusick |
Get rid of the global variable rushjob and replace it with a function in kern/vfs_subr.c named speedup_syncer() which handles the speedup request. Change the various clients of rushjob to use the new function.
|
47640 |
31-May-1999 |
phk |
Simplify cdevsw registration.
The cdevsw_add() function now finds the major number(s) in the struct cdevsw passed to it. cdevsw_add_generic() is no longer needed, cdevsw_add() does the same thing.
cdevsw_add() will print an message if the d_maj field looks bogus.
Remove nblkdev and nchrdev variables. Most places they were used bogusly. Instead check a dev_t for validity by seeing if devsw() or bdevsw() returns NULL.
Move bdevsw() and devsw() functions to kern/kern_conf.c
Bump __FreeBSD_version to 400006
This commit removes: 72 bogus makedev() calls 26 bogus SYSINIT functions
if_xe.c bogusly accessed cdevsw[], author/maintainer please fix.
I4b and vinum not changed. Patches emailed to authors. LINT probably broken until they catch up.
|
47381 |
22-May-1999 |
julian |
Cosmetic changes to make it compile without errors in gcc -Wall
|
47131 |
14-May-1999 |
mckusick |
Add a hook to ffs_fsync to allow soft updates to get first chance at doing a sync on the block device for the filesystem. That allows it to push the bitmap blocks before the inode blocks which greatly reduces the number of inode rollbacks that need to be done.
|
47085 |
12-May-1999 |
peter |
Try and fix a dev_t/major/minor etc nit.
|
46827 |
09-May-1999 |
mckusick |
Put back changes that might be causing trouble on Alpha.
|
46676 |
08-May-1999 |
phk |
I got tired of seeing all the cdevsw[major(foo)] all over the place.
Made a new (inline) function devsw(dev_t dev) and substituted it.
Changed to the BDEV variant to this format as well: bdevsw(dev_t dev)
DEVFS will eventually benefit from this change too.
|
46635 |
07-May-1999 |
phk |
Continue where Julian left off in July 1998:
Virtualize bdevsw[] from cdevsw. bdevsw() is now an (inline) function.
Join CDEV_MODULE and BDEV_MODULE to DEV_MODULE (please pay attention to the order of the cmaj/bmaj arguments!)
Join CDEV_DRIVER_MODULE and BDEV_DRIVER_MODULE to DEV_DRIVER_MODULE (ditto!)
(Next step will be to convert all bdev dev_t's to cdev dev_t's before they get to do any damage^H^H^H^H^H^Hwork in the kernel.)
|
46618 |
07-May-1999 |
mckusick |
Whitespace cleanup.
|
46616 |
07-May-1999 |
mckusick |
Get rid of random debugging cruft; sync up with latest version.
|
46609 |
07-May-1999 |
mckusick |
Severe slowdowns have been reported when creating or removing many files at once on a filesystem running soft updates. The root of the problem is that soft updates limits the amount of memory that may be allocated to dependency structures so as to avoid hogging kernel memory. The original algorithm just waited for the disk I/O to catch up and reduce the number of dependencies. This new code takes a much more aggressive approach. Basically there are two resources that routinely hit the limit. Inode dependencies during periods with a high file creation rate and file and block removal dependencies during periods with a high file removal rate. I have attacked these problems from two fronts. When the inode dependency limits are reached, I pick a random inode dependency, UFS_UPDATE it together with all the other dirty inodes contained within its disk block and then write that disk block. This trick usually clears 5-50 inode dependencies in a single disk I/O. For block and file removal dependencies, I pick a random directory page that has at least one remove pending and VOP_FSYNC its directory. That releases all its removal dependencies to the work queue. To further hasten things along, I also immediately start the work queue process rather than waiting for its next one second scheduled run.
|
46568 |
06-May-1999 |
peter |
Add sufficient braces to keep egcs happy about potentially ambiguous if/else nesting.
|
46349 |
02-May-1999 |
alc |
The VFS/BIO subsystem contained a number of hacks in order to optimize piecemeal, middle-of-file writes for NFS. These hacks have caused no end of trouble, especially when combined with mmap(). I've removed them. Instead, NFS will issue a read-before-write to fully instantiate the struct buf containing the write. NFS does, however, optimize piecemeal appends to files. For most common file operations, you will not notice the difference. The sole remaining fragment in the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache coherency issues with read-merge-write style operations. NFS also optimizes the write-covers-entire-buffer case by avoiding the read-before-write. There is quite a bit of room for further optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid = VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This is not correct operation. The vm_pager_get_pages() code is now responsible for marking VM pages all-valid. A number of VM helper routines have been added to aid in zeroing-out the invalid portions of a VM page prior to the page being marked all-valid. This operation is necessary to properly support mmap(). The zeroing occurs most often when dealing with file-EOF situations. Several bugs have been fixed in the NFS subsystem, including bits handling file and directory EOF situations and buf->b_flags consistancy issues relating to clearing B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now formally defined in comments and more straightforward in implementation. B_CACHE for VMIO buffers is based on the validity of the backing store. B_CACHE for non-VMIO buffers is based simply on whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear, and vise-versa). biodone() is now responsible for setting B_CACHE when a successful read completes. B_CACHE is also set when a bdwrite() is initiated and when a bwrite() is initiated. VFS VOP_BWRITE routines (there are only two - nfs_bwrite() and bwrite()) are now expected to set B_CACHE. This means that bowrite() and bawrite() also set B_CACHE indirectly.
There are a number of places in the code which were previously using buf->b_bufsize (which is DEV_BSIZE aligned) when they should have been using buf->b_bcount. These have been fixed. getblk() now clears B_DONE on return because the rest of the system is so bad about dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause requests to be lost by the server due to nfs_realign() overwriting other rpc's in the same TCP mbuf chain. The server's kernel must be recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
|
46124 |
27-Apr-1999 |
msmith |
Simplify the tunefs example, since tunefs uses getfsfile(). Lots of people complain about working out what device their filesystems are mounted on.
|
44398 |
02-Mar-1999 |
mckusick |
Reorganize locking to avoid holding the lock during calls to bdwrite and brelse (which may sleep in some systems).
Obtained from: Matthew Dillon <dillon@apollo.backplane.com>
|
44391 |
02-Mar-1999 |
mckusick |
When fsync'ing a file on a filesystem using soft updates, we first try to write all the dirty blocks. If some of those blocks have dependencies, they will be remarked dirty when the I/O completes. On systems with really fast I/O systems, it is possible to get in an infinite loop trying to flush the buffers, because the I/O finishes before we can get all the dirty buffers off the v_dirtyblkhd list and into the I/O queue. (The previous algorithm looped over the v_dirtyblkhd list writing out buffers until the list emptied.) So, now we mark each buffer that we try to write so that we can distinguish the ones that are being remarked dirty from those that we have not yet tried to flush. Once we have tried to push every buffer once, we then push any associated metadata that is causing the remaining buffers to be redirtied.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
|
44383 |
02-Mar-1999 |
mckusick |
Ensure that softdep_sync_metadata can handle bmsafemap and mkdir entries if they ever arise (which should not happen as softdep_sync_metadata is currently used).
|
44102 |
17-Feb-1999 |
mckusick |
fix double LIST_REMOVE; other cosmetic changes to match version 9.32. Obtained from: Jeffrey Hsu <hsu@FreeBSD.ORG>
|
43311 |
28-Jan-1999 |
dillon |
Fix warnings in preparation for adding -Wall -Wcast-qual to the kernel compile
|
43044 |
22-Jan-1999 |
dg |
Gutted softdep_deallocate_dependencies and replaced it with a panic. It turns out to not be useful to unwind the dependencies and continue in the face of a fatal error. Also changed the log() to a printf() in softdep_error() so that it will be output in the case of a impending panic. Submitted by: Kirk McKusick <mckusick@mckusick.com>
|
42567 |
12-Jan-1999 |
eivind |
Silence warning about unused debug function. (I'll turn this function into a DDB command in my next staticization sweep).
|
42400 |
08-Jan-1999 |
eivind |
Add a warning about the copyright restraints.
|
42374 |
07-Jan-1999 |
bde |
Don't pass unused unused timestamp args to UFS_UPDATE() or waste time initializing them. This almost finishes centralizing (in-core) timestamp updates in ufs_itimes().
|
42354 |
06-Jan-1999 |
bde |
UFS_UPDATE() takes a boolean `waitfor' arg, so don't pass it the value MNT_WAIT when we mean boolean `true' or check for that value not being passed. There was no problem in practice because MNT_WAIT had the magic value of 1.
|
42351 |
06-Jan-1999 |
bde |
Ifdefed the conditionally used variable `prtrealloc'. Declare it as volatile so that there is no chance that the code that it controls is optimised away.
|
42350 |
06-Jan-1999 |
bde |
Backed out rev.1.47. It just broke my optimisations for lazy syncing of timestamps in rev.1.45. The soft updates bug was elsewhere.
Forgotten by: luoqi
|
42315 |
05-Jan-1999 |
eivind |
Remove the 'waslocked' parameter to vfs_object_create().
|
42244 |
02-Jan-1999 |
eivind |
Remove the last clients of vfs_object_create(..., waslocked=1); waslocked will go away shortly.
Reviewed by: dg
|
41659 |
10-Dec-1998 |
julian |
Remove some compiler warnings.
|
41395 |
29-Nov-1998 |
bde |
Don't use the strange null pointer constant `(ufs_daddr_t)0' in a call to VOP_BMAP(). Don't use uncast NULLs in the same call.
|
41124 |
13-Nov-1998 |
dg |
Restored the "reallocblks" code to its former glory. What this does is basically do a on-the-fly defragmentation of the FFS filesystem, changing file block allocations to make them contiguous. Thanks to Kirk McKusick for providing hints on what needed to be done to get this working.
|
40791 |
31-Oct-1998 |
peter |
Change dirty block list handling to use TAILQ macros.
|
40790 |
31-Oct-1998 |
peter |
Use TAILQ macros for clean/dirty block list processing. Set b_xflags rather than abusing the list next pointer with a magic number.
|
40692 |
28-Oct-1998 |
jkh |
Clarify a rather ambiguous debugging message.
|
40672 |
27-Oct-1998 |
bde |
Oops, the redundant tests for major numbers weren't redundant here. They checked for the magic major number for the "device" behind mfs mount points. Use a more obvious check for this device.
Debugged by: Andrew Gallatin <gallatin@cs.duke.edu>
|
40649 |
25-Oct-1998 |
bde |
Don't follow null bdevsw pointers. The `major(dev) < nblkdev' test rotted when bdevsw[] became sparse. We still depend on magic to avoid having to check that (v_rdev) device numbers in vnodes are not NODEV.
Removed redundant `major(dev) < nblkdev' tests instead of updating them.
|
40648 |
25-Oct-1998 |
phk |
Nitpicking and dusting performed on a train. Removes trivial warnings about unused variables, labels and other lint.
|
39933 |
03-Oct-1998 |
nate |
Fix 'noatime' bug that was unrelated to use of noatime.
The problem is caused when a directory block is compacted. When this occurs, softdep_change_directoryentry_offset() is called to relocate each directory entry and adjust its matching diradd structure, if any, to match the new location of the entry. The bug is that while softdep_change_directoryentry_offset() correctly adjusts the offsets of the diradd structures on the pd_diraddhd[] lists (which are not yet ready to be committed to disk), it fails to adjust the offsets of the diradd structures on the pd_pendinghd list (which are ready to be committed to disk). This causes the dependency structures to be inconsistent with the buf contents. Now, if the compaction has moved a directory entry to the same offset as one of the diradd structures on the pd_pendinghd list *and* a syscall is done that tries to remove this directory entry before this directory block has been written to disk (which would empty pd_pendinghd), a sanity check in newdirrem() will call panic() when it notices that the inode number in the entry that it is to be removed doesn't match the inode number in the diradd structure with that offset of that entry.
Reviewed by: Kirk McKusick <mckusick@McKusick.COM> Submitted by: Don Lewis <Don.Lewis@tsc.tdk.com>
|
39669 |
26-Sep-1998 |
bde |
Fixed clean flag handling: - don't set the clean flag on unmount of an unclean filesystem that was (forcibly) mounted rw. - set the clean flag on rw -> ro update of a mounted initially-clean filesystem. - fixed some style bugs (mostly long lines).
This uses the fs_flags field and FS_UNCLEAN state bit which were introduced in the softdep changes. NetBSD uses extra state bits in fs_clean.
Reviewed by: luoqui
|
39623 |
24-Sep-1998 |
luoqi |
Eliminate a race in VOP_FSYNC() when softupdates is enabled. Submitted by: Kirk McKusick <mckusick@McKusick.COM> Two minor changes are also included, 1. Remove gratuitious checks for error return from vn_lock with LK_RETRY set, vn_lock should always succeed in these cases. 2. Back out change rev. 1.36->1.37, which unnecessarily makes async mount a little more unstable. It also keeps us in sync with other BSDs. Suggested by: Bruce Evans <bde@zeta.org.au>
|
39281 |
15-Sep-1998 |
luoqi |
Restore pre-v1.44 behavior: always copy modified in-core inode to disk buffer. Otherwise some in-core inode changes might be lost, including important meta data (e.g. size) if softupdates is enabled.
|
39187 |
14-Sep-1998 |
sos |
Remove the SLICE code. This clearly needs alot more thought, and we dont need this to hunt us down in 3.0-RELEASE.
|
39099 |
12-Sep-1998 |
bde |
Don't dereference an uninitialized pointer in dead code. The dead code gets executed if it is compiled without optimization.
|
38909 |
07-Sep-1998 |
bde |
Removed statically configured mount type numbers (MOUNT_*) and all references to them.
The change a couple of days ago to ignore these numbers in statically configured vfsconf structs was slightly premature because the cd9660, cfs, devfs, ext2fs, nfs vfs's still used MOUNT_* instead of the number in their vfsconf struct.
|
38907 |
07-Sep-1998 |
bde |
Put the zombie ffs sysctl node in "notyet" state together with its few remaining children. Prepare it for MOUNT_UFS going away.
|
38862 |
05-Sep-1998 |
phk |
Add a new vnode op, VOP_FREEBLKS(), which filesystems can use to inform device drivers about sectors no longer in use.
Device-drivers receive the call through d_strategy, if they have D_CANFREE in d_flags.
This allows flash based devices to erase the sectors and avoid pointlessly carrying them around in compactions.
Reviewed by: Kirk Mckusick, bde Sponsored by: M-Systems (www.m-sys.com)
|
38408 |
17-Aug-1998 |
bde |
Removed unused includes.
|
38291 |
12-Aug-1998 |
julian |
Handle the case of moving a directory onto the top of a sibling's child of the same name.
Submitted by: Kirk Mckusick with fixes from luoqi Chen Obtained from: Whistle test tree.
|
37555 |
11-Jul-1998 |
bde |
Fixed printf format errors.
|
37520 |
08-Jul-1998 |
julian |
Don't update superblock if mounted readonly, also fixes some problems with softupdates on root. More cleanups are needed here.. Submitted by: Luoqi Chen <luoqi@watermarkgroup.com>
|
37384 |
04-Jul-1998 |
julian |
VOP_STRATEGY grows an (struct vnode *) argument as the value in b_vp is often not really what you want. (and needs to be frobbed). more cleanups will follow this. Reviewed by: Bruce Evans <bde@freebsd.org>
|
37363 |
03-Jul-1998 |
bde |
Sync timestamp changes for inodes of special files to disk as late as possible (when the inode is reclaimed). Temporarily only do this if option UFS_LAZYMOD configured and softupdates aren't enabled. UFS_LAZYMOD is intentionally left out of /sys/conf/options.
This is mainly to avoid almost useless disk i/o on battery powered machines. It's silly to write to disk (on the next sync or when the inode becomes inactive) just because someone hit a key or something wrote to the screen or /dev/null.
PR: 5577 Previous version reviewed by: phk
|
37362 |
03-Jul-1998 |
bde |
Centralized in-core inode update. Update the in-core inode directly in ufs_setattr() so that there is no need to pass timestamps to UFS_UPDATE() (everything else just needs the current time). Ignore the passed-in timestamps in UFS_UPDATE() and always call ufs_itimes() (was: itimes()) to do the update. The timestamps are still passed so that all the callers don't need to be changed yet.
|
37167 |
26-Jun-1998 |
jkh |
Flesh this document out just a little in response to some user questions and also recommend linking over copying since, at this stage, a stale copy is a real concern.
|
36990 |
14-Jun-1998 |
julian |
Slight change to directory cleanup Makes soft updates a bit cleaner. Eliminates some warnings about 'corrupted directories' from fsck.
|
36936 |
12-Jun-1998 |
julian |
Note which version of Kirk's sources this corresponds to.
|
36935 |
12-Jun-1998 |
julian |
Fix the case when renaming to a file that you've just created and deleted, that had an inode that has not yet been written to disk, when the inode of the new file is also not yet written to disk, and your old directory entry is not yet on disk but you need to remove it and the new name exists in memory but has been deleted but the transaction to write the deleted name to disk exists and has not yet been cancelled by the request to delete the non existant name. I don't know how kirk could have missed such a glaring problem for so long. :-) Especially since the inconsitency survived on the disk for a whole 4 second on average before being fixed by other code. This was not a crashing bug but just led to filesystem inconsitencies if you crashed.
Submitted by: Kirk McKusick (mckusick@mckusick.com)
|
36900 |
11-Jun-1998 |
julian |
Add B_NOCACHE to several cases where BSD4.4 only required a B_INVAL. Change worked out by john and kirk in consort.
|
36871 |
10-Jun-1998 |
julian |
Fix for "live inode" panic. Submitted by: Kirk McKusick <mckusick@McKusick.COM> Reviewed by: yeah right...
|
36866 |
10-Jun-1998 |
julian |
Remove buggy debugging code.
|
36863 |
10-Jun-1998 |
julian |
Back out John's changes 1.45 -> 1.46 Kirk confirms that the original semantic was what he wanted... (well, a very slight difference) May fix "dangling deps" panic with soft updates.
|
36646 |
04-Jun-1998 |
dfr |
Use size_t instead of u_int for sizes.
|
36581 |
02-Jun-1998 |
julian |
Add a reference to the original softupdates paper
|
36580 |
02-Jun-1998 |
julian |
Add a reference to the Ganger/Patt paper
|
36404 |
27-May-1998 |
julian |
A fix to a debug test from Kirk.
|
36235 |
19-May-1998 |
julian |
Ensure that there is enough information here, so that people can use soft updates should they desire.
|
36234 |
19-May-1998 |
julian |
Bring up-to-date with Whistle's current version Includes some debugging code.
|
36232 |
19-May-1998 |
julian |
Merge with Kirk's version as of Feb 20
His version 9.23 == our version 1.5 of ffs_softdep.c His version 9.5 == our version 1.4 of softdep.c
|
36225 |
19-May-1998 |
julian |
Merge in Kirk's changes to stop softupdates from hogging all of memory.
|
36212 |
19-May-1998 |
julian |
Change to stop a silly panic. This should be understood better. Change a buffer swizzle trick to a bcopy. It would be nice if the efficient trick could be used in the future.
|
36210 |
19-May-1998 |
julian |
First published FreeBSD version of soft updates Feb 5.
|
36207 |
19-May-1998 |
julian |
This commit was generated by cvs2svn to compensate for changes in r36206, which included commits to RCS files with non-trunk default branches.
|
36202 |
19-May-1998 |
julian |
This commit was generated by cvs2svn to compensate for changes in r36201, which included commits to RCS files with non-trunk default branches.
|
36147 |
18-May-1998 |
julian |
try stop the user from using mount -u to set the async flag on a filesystem currently using soft updates. Also needs a new copy of ffs_softdep.c to complete the fix.
|
35955 |
11-May-1998 |
julian |
Add missing splx()
Submitted by: Luoqi Chen <luoqi@chen.ml.org>
|
35769 |
06-May-1998 |
msmith |
As described by the submitter:
Reverse the VFS_VRELE patch. Reference counting of vnodes does not need to be done per-fs. I noticed this while fixing vfs layering violations. Doing reference counting in generic code is also the preference cited by John Heidemann in recent discussions with him.
The implementation of alternative vnode management per-fs is still a valid requirement for some filesystems but will be revisited sometime later, most likely using a different framework.
Submitted by: Michael Hancock <michaelh@cet.co.jp>
|
35696 |
04-May-1998 |
dyson |
Correct an error that I made where the vtruncbuf was changed back to vinvalbuf, but I incorrectly added the "V_SAVE|V_SAVEMETA" flags. Submitted by: Luoqi Chen <luoqi@watermarkgroup.com>
|
35526 |
30-Apr-1998 |
dyson |
Fix an error that I made with an optimization. In the case of softupdates, we need to do vtruncbuf the old way. Luoqi caught, found the bug and submitted this fix. Submitted by: Luoqi Chen <luoqi@chen.ml.org>
|
35323 |
20-Apr-1998 |
julian |
Make the devfs SLICE option a standard type option. (hopefully it will go away eventually anyhow)
|
35319 |
19-Apr-1998 |
julian |
Add changes and code to implement a functional DEVFS. This code will be turned on with the TWO options DEVFS and SLICE. (see LINT) Two labels PRE_DEVFS_SLICE and POST_DEVFS_SLICE will deliniate these changes.
/dev will be automatically mounted by init (thanks phk) on bootup. See /sys/dev/slice/slice.4 for more info. All code should act the same without these options enabled.
Mike Smith, Poul Henning Kamp, Soeren, and a few dozen others
This code does not support the following: bad144 handling. Persistance. (My head is still hurting from the last time we discussed this) ATAPI flopies are not handled by the SLICE code yet.
When this code is running, all major numbers are arbitrary and COULD be dynamically assigned. (this is not done, for POLA only) Minor numbers for disk slices ARE arbitray and dynamically assigned.
|
34961 |
30-Mar-1998 |
phk |
Eradicate the variable "time" from the kernel, using various measures. "time" wasn't a atomic variable, so splfoo() protection were needed around any access to it, unless you just wanted the seconds part.
Most uses of time.tv_sec now uses the new variable time_second instead.
gettime() changed to getmicrotime(0.
Remove a couple of unneeded splfoo() protections, the new getmicrotime() is atomic, (until Bruce sets a breakpoint in it).
A couple of places needed random data, so use read_random() instead of mucking about with time which isn't random.
Add a new nfs_curusec() function.
Mark a couple of bogosities involving the now disappeard time variable.
Update ffs_update() to avoid the weird "== &time" checks, by fixing the one remaining call that passwd &time as args.
Change profiling in ncr.c to use ticks instead of time. Resolution is the same.
Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call hzto() which subtracts time" sequences.
Reviewed by: bde
|
34924 |
28-Mar-1998 |
bde |
Moved some #includes from <sys/param.h> nearer to where they are actually used.
|
34913 |
27-Mar-1998 |
peter |
Enable the use of soft updates on the root filesystem. Previously, the softdep mode could only be activated on the initial mount of a filesystem and then only if it was a read-write mount. A 'mount -r' (as done in the rootfs mount) followed by a 'mount -u' to convert to read-write didn't start softdep mode.
|
34901 |
26-Mar-1998 |
phk |
Add two new functions, get{micro|nano}time.
They are atomic, but return in essence what is in the "time" variable. gettime() is now a macro front for getmicrotime().
Various patches to use the two new functions instead of the various hacks used in their absence.
Some puntuation and grammer patches from Bruce.
A couple of XXX comments.
|
34826 |
23-Mar-1998 |
bde |
Forward declare even more structs to restore some self-sufficiency. Didn't fix new dependence on <ufs/ufs/inode.h> and its prerequisites.
|
34734 |
21-Mar-1998 |
dyson |
Softdep_sync_metadata appears to expect that it is called at splbio, so make it so...
|
34696 |
19-Mar-1998 |
dyson |
Fix vfs_bio_awrite usage, and correct vtruncbuf usage.
|
34611 |
16-Mar-1998 |
dyson |
Some VM improvements, including elimination of alot of Sig-11 problems. Tor Egge and others have helped with various VM bugs lately, but don't blame him -- blame me!!!
pmap.c: 1) Create an object for kernel page table allocations. This fixes a bogus allocation method previously used for such, by grabbing pages from the kernel object, using bogus pindexes. (This was a code cleanup, and perhaps a minor system stability issue.)
pmap.c: 2) Pre-set the modify and accessed bits when prudent. This will decrease bus traffic under certain circumstances.
vfs_bio.c, vfs_cluster.c: 3) Rather than calculating the beginning virtual byte offset multiple times, stick the offset into the buffer header, so that the calculated offset can be reused. (Long long multiplies are often expensive, and this is a probably unmeasurable performance improvement, and code cleanup.)
vfs_bio.c: 4) Handle write recursion more intelligently (but not perfectly) so that it is less likely to cause a system panic, and is also much more robust.
vfs_bio.c: 5) getblk incorrectly wrote out blocks that are incorrectly sized. The problem is fixed, and writes blocks out ONLY when B_DELWRI is true.
vfs_bio.c: 6) Check that already constituted buffers have fully valid pages. If not, then make sure that the B_CACHE bit is not set. (This was a major source of Sig-11 type problems.)
vfs_bio.c: 7) Fix a potential system deadlock due to an incorrectly specified sleep priority while waiting for a buffer write operation. The change that I made opens the system up to serious problems, and we need to examine the issue of process sleep priorities.
vfs_cluster.c, vfs_bio.c: 8) Make clustered reads work more correctly (and more completely) when buffers are already constituted, but not fully valid. (This was another system reliability issue.)
vfs_subr.c, ffs_inode.c: 9) Create a vtruncbuf function, which is used by filesystems that can truncate files. The vinvalbuf forced a file sync type operation, while vtruncbuf only invalidates the buffers past the new end of file, and also invalidates the appropriate pages. (This was a system reliabiliy and performance issue.)
10) Modify FFS to use vtruncbuf.
vm_object.c: 11) Make the object rundown mechanism for OBJT_VNODE type objects work more correctly. Included in that fix, create pager entries for the OBJT_DEAD pager type, so that paging requests that might slip in during race conditions are properly handled. (This was a system reliability issue.)
vm_page.c: 12) Make some of the page validation routines be a little less picky about arguments passed to them. Also, support page invalidation change the object generation count so that we handle generation counts a little more robustly.
vm_pageout.c: 13) Further reduce pageout daemon activity when the system doesn't need help from it. There should be no additional performance decrease even when the pageout daemon is running. (This was a significant performance issue.)
vnode_pager.c: 14) Teach the vnode pager to handle race conditions during vnode deallocations.
|
34266 |
08-Mar-1998 |
julian |
Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman) Submitted by: Kirk McKusick (mcKusick@mckusick.com) Obtained from: WHistle development tree
|
34248 |
08-Mar-1998 |
julian |
Submitted by: kirk McKusick
Stub file for soft updates.
|
34206 |
07-Mar-1998 |
dyson |
This mega-commit is meant to fix numerous interrelated problems. There has been some bitrot and incorrect assumptions in the vfs_bio code. These problems have manifest themselves worse on NFS type filesystems, but can still affect local filesystems under certain circumstances. Most of the problems have involved mmap consistancy, and as a side-effect broke the vfs.ioopt code. This code might have been committed seperately, but almost everything is interrelated.
1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that are fully valid. 2) Rather than deactivating erroneously read initial (header) pages in kern_exec, we now free them. 3) Fix the rundown of non-VMIO buffers that are in an inconsistent (missing vp) state. 4) Fix the disassociation of pages from buffers in brelse. The previous code had rotted and was faulty in a couple of important circumstances. 5) Remove a gratuitious buffer wakeup in vfs_vmio_release. 6) Remove a crufty and currently unused cluster mechanism for VBLK files in vfs_bio_awrite. When the code is functional, I'll add back a cleaner version. 7) The page busy count wakeups assocated with the buffer cache usage were incorrectly cleaned up in a previous commit by me. Revert to the original, correct version, but with a cleaner implementation. 8) The cluster read code now tries to keep data associated with buffers more aggressively (without breaking the heuristics) when it is presumed that the read data (buffers) will be soon needed. 9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The delay loop waiting is not useful for filesystem locks, due to the length of the time intervals. 10) Correct and clean-up spec_getpages. 11) Implement a fully functional nfs_getpages, nfs_putpages. 12) Fix nfs_write so that modifications are coherent with the NFS data on the server disk (at least as well as NFS seems to allow.) 13) Properly support MS_INVALIDATE on NFS. 14) Properly pass down MS_INVALIDATE to lower levels of the VM code from vm_map_clean. 15) Better support the notion of pages being busy but valid, so that fewer in-transit waits occur. (use p->busy more for pageouts instead of PG_BUSY.) Since the page is fully valid, it is still usable for reads. 16) It is possible (in error) for cached pages to be busy. Make the page allocation code handle that case correctly. (It should probably be a printf or panic, but I want the system to handle coding errors robustly. I'll probably add a printf.) 17) Correct the design and usage of vm_page_sleep. It didn't handle consistancy problems very well, so make the design a little less lofty. After vm_page_sleep, if it ever blocked, it is still important to relookup the page (if the object generation count changed), and verify it's status (always.) 18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up. 19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush. 20) Fix vm_pager_put_pages and it's descendents to support an int flag instead of a boolean, so that we can pass down the invalidate bit.
|
34184 |
07-Mar-1998 |
bde |
Fixed missing simple_lock() in ffs_mountfs().
|
33964 |
01-Mar-1998 |
msmith |
The intent is to get rid of WILLRELE in vnode_if.src by making a complement to all ops that return a vpp, VFS_VRELE. This is initially only for file systems that implement the following ops that do a WILLRELE:
vop_create, vop_whiteout, vop_mknod, vop_remove, vop_link, vop_rename, vop_mkdir, vop_rmdir, vop_symlink
This is initial DNA that doesn't do anything yet. VFS_VRELE is implemented but not called.
A default vfs_vrele was created for fs implementations that use the standard vnode management routines.
VFS_VRELE implementations were made for the following file systems:
Standard (vfs_vrele) ffs mfs nfs msdosfs devfs ext2fs
Custom union umapfs
Just EOPNOTSUPP fdesc procfs kernfs portal cd9660
These implementations may change as VOP changes are implemented.
In the next phase, in the vop implementations calls to vrele and the vrele part of vput will be moved to the top layer vfs_vnops and made visible to all layers. vput will be replaced by unlock in these cases. Unlocking will still be done in the per fs layer but the refcount decrement will be triggered at the top because it doesn't hurt to hold a vnode reference a little longer. This will have minimal impact on the structure of the existing code.
This will only be done for vnode arguments that are released by the various fs vop implementations.
Wider use of VFS_VRELE will likely require restructuring of the code.
Reviewed by: phk, dyson, terry et. al. Submitted by: Michael Hancock <michaelh@cet.co.jp>
|
33847 |
26-Feb-1998 |
msmith |
In the author's words:
These diffs implement the first stage of a VOP_{GET|PUT}PAGES pushdown for local media FS's.
See ffs_putpages in /sys/ufs/ufs/ufs_readwrite.c for implementation details for generic *_{get|put}pages for local media FS's. Support is trivial to add for any FS that formerly relied on the default behaviour of the vnode_pager in in EOPNOTSUPP cases (just copy the ffs_getpages() code for the FS in question's *_{get|put}pages).
Obviously, it would be better if each local media FS implemented a more optimal method, instead of calling an exported interface from the /sys/vm/vnode_pager.c, but this is a necessary first step in getting the FS's to a point where they can be supplied with better implementations on a case-by-case basis.
Obviously, the cd9660_putpages() can be rather trivial (since it is a read-only FS type 8-)).
A slight (temporary) modification is made to print a diagnostic message in the case where the underlying filesystem attempts to engage in the previous behaviour. Failure is likely to be ungraceful.
Submitted by: terry@freebsd.org (Terry Lambert)
|
33820 |
25-Feb-1998 |
bde |
Fixed missing permissions checking for mounting by non-root.
There is now less need for the vfs.usermount sysctl. msdosfs already has this change, modulo a missing LK_RETRY, via NetBSD. At least ext2fs is missing this and many other changes from Lite2.
Obtained from: Lite2
|
33289 |
13-Feb-1998 |
bde |
Removed unnecessary dependencies on KERNEL and DIAGNOSTIC. This was more useful when opt_diagnostic.h had to be included.
|
33181 |
09-Feb-1998 |
eivind |
Staticize.
|
33134 |
06-Feb-1998 |
eivind |
Back out DIAGNOSTIC changes.
|
33108 |
04-Feb-1998 |
eivind |
Turn DIAGNOSTIC into a new-style option.
|
33054 |
03-Feb-1998 |
bde |
Forward declare some structs so that this file is more self-sufficient.
|
32976 |
01-Feb-1998 |
dyson |
Back out recent laptop sync changes. They had significant errors.
|
32951 |
01-Feb-1998 |
dyson |
Support more intelligent sync operations for MNT_NOATIME. PR: kern/5577 Submitted by: Craig Leres <leres@ee.lbl.gov>
|
32702 |
22-Jan-1998 |
dyson |
VM level code cleanups.
1) Start using TSM. Struct procs continue to point to upages structure, after being freed. Struct vmspace continues to point to pte object and kva space for kstack. u_map is now superfluous. 2) vm_map's don't need to be reference counted. They always exist either in the kernel or in a vmspace. The vmspaces are managed by reference counts. 3) Remove the "wired" vm_map nonsense. 4) No need to keep a cache of kernel stack kva's. 5) Get rid of strange looking ++var, and change to var++. 6) Change more data structures to use our "zone" allocator. Added struct proc, struct vmspace and struct vnode. This saves a significant amount of kva space and physical memory. Additionally, this enables TSM for the zone managed memory. 7) Keep ioopt disabled for now. 8) Remove the now bogus "single use" map concept. 9) Use generation counts or id's for data structures residing in TSM, where it allows us to avoid unneeded restart overhead during traversals, where blocking might occur. 10) Account better for memory deficits, so the pageout daemon will be able to make enough memory available (experimental.) 11) Fix some vnode locking problems. (From Tor, I think.) 12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp. (experimental.) 13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c code. Use generation counts, get rid of unneded collpase operations, and clean up the cluster code. 14) Make vm_zone more suitable for TSM.
This commit is partially as a result of discussions and contributions from other people, including DG, Tor Egge, PHK, and probably others that I have forgotten to attribute (so let me know, if I forgot.)
This is not the infamous, final cleanup of the vnode stuff, but a necessary step. Vnode mgmt should be correct, but things might still change, and there is still some missing stuff (like ioopt, and physical backing of non-merged cache files, debugging of layering concepts.)
|
32585 |
17-Jan-1998 |
dyson |
Tie up some loose ends in vnode/object management. Remove an unneeded config option in pmap. Fix a problem with faulting in pages. Clean-up some loose ends in swap pager memory management.
The system should be much more stable, but all subtile bugs aren't fixed yet.
|
32286 |
06-Jan-1998 |
dyson |
Make our v_usecount vnode reference count work identically to the original BSD code. The association between the vnode and the vm_object no longer includes reference counts. The major difference is that vm_object's are no longer freed gratuitiously from the vnode, and so once an object is created for the vnode, it will last as long as the vnode does.
When a vnode object reference count is incremented, then the underlying vnode reference count is incremented also. The two "objects" are now more intimately related, and so the interactions are now much less complex.
When vnodes are now normally placed onto the free queue with an object still attached. The rundown of the object happens at vnode rundown time, and happens with exactly the same filesystem semantics of the original VFS code. There is absolutely no need for vnode_pager_uncache and other travesties like that anymore.
A side-effect of these changes is that SMP locking should be much simpler, the I/O copyin/copyout optimizations work, NFS should be more ponderable, and further work on layered filesystems should be less frustrating, because of the totally coherent management of the vnode objects and vnodes.
Please be careful with your system while running this code, but I would greatly appreciate feedback as soon a reasonably possible.
|
32071 |
29-Dec-1997 |
dyson |
Lots of improvements, including restructring the caching and management of vnodes and objects. There are some metadata performance improvements that come along with this. There are also a few prototypes added when the need is noticed. Changes include:
1) Cleaning up vref, vget. 2) Removal of the object cache. 3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore. 4) Correct some missing LK_RETRY's in vn_lock. 5) Correct the page range in the code for msync.
Be gentle, and please give me feedback asap.
|
31561 |
05-Dec-1997 |
bde |
Don't include <sys/lock.h> in headers when only `struct simplelock' is required. Fixed everything that depended on the pollution.
|
31484 |
02-Dec-1997 |
bde |
Fix a small style bug in the generation number change (rev.1.33) before copying the change to other fs's.
|
31352 |
22-Nov-1997 |
bde |
Staticized.
|
31351 |
22-Nov-1997 |
bde |
Unremoved prtrealloc and the declaration of ffs_clusteralloc(). These are used in the `#ifdef notyet' case :-). This case is used except in the `#if !defined (not_yes)' case :-|. This has something to do with the `#ifdef notyet_block_reallocation_enabled' case in vfs_cluster.c :-(.
|
31274 |
18-Nov-1997 |
bde |
Removed an unused #include in the `#ifdef KERNEL' case.
Fixed a comment to match the code. The code is still wrong (ffs_checkoverlap() should be staticized and called from a ddb command).
|
31132 |
12-Nov-1997 |
julian |
Reviewed by: various.
Ever since I first say the way the mount flags were used I've hated the fact that modes, and events, internal and exported, and short-term and long term flags are all thrown together. Finally it's annoyed me enough.. This patch to the entire FreeBSD tree adds a second mount flag word to the mount struct. it is not exported to userspace. I have moved some of the non exported flags over to this word. this means that we now have 8 free bits in the mount flags. There are another two that might well move over, but which I'm not sure about. The only user visible change would have been in pstat -v, except that davidg has disabled it anyhow. I'd still like to move the state flags and the 'command' flags apart from each other.. e.g. MNT_FORCE really doesn't have the same semantics as MNT_RDONLY, but that's left for another day.
|
31016 |
07-Nov-1997 |
phk |
Remove a bunch of variables which were unused both in GENERIC and LINT.
Found by: -Wunused
|
30780 |
27-Oct-1997 |
bde |
Removed unused #includes. The need for most of them went away with recent changes (docluster* and vfs improvements).
|
30492 |
16-Oct-1997 |
phk |
Another VFS cleanup "kilo commit"
1. Remove VOP_UPDATE, it is (also) an UFS/{FFS,LFS,EXT2FS,MFS} intereface function, and now lives in the ufsmount structure.
2. Remove VOP_SEEK, it was unused.
3. Add mode default vops:
VOP_ADVLOCK vop_einval VOP_CLOSE vop_null VOP_FSYNC vop_null VOP_IOCTL vop_enotty VOP_MMAP vop_einval VOP_OPEN vop_null VOP_PATHCONF vop_einval VOP_READLINK vop_einval VOP_REALLOCBLKS vop_eopnotsupp
And remove identical functionality from filesystems
4. Add vop_stdpathconf, which returns the canonical stuff. Use it in the filesystems. (XXX: It's probably wrong that specfs and fifofs sets this vop, shouldn't it come from the "host" filesystem, for instance ufs or cd9660 ?)
5. Try to make system wide VOP functions have vop_* names.
6. Initialize the um_* vectors in LFS.
(Recompile your LKMS!!!)
|
30474 |
16-Oct-1997 |
phk |
VFS mega cleanup commit (x/N)
1. Add new file "sys/kern/vfs_default.c" where default actions for VOPs go. Implement proper defaults for ABORTOP, BWRITE, LEASE, POLL, REVOKE and STRATEGY. Various stuff spread over the entire tree belongs here.
2. Change VOP_BLKATOFF to a normal function in cd9660.
3. Kill VOP_BLKATOFF, VOP_TRUNCATE, VOP_VFREE, VOP_VALLOC. These are private interface functions between UFS and the underlying storage manager layer (FFS/LFS/MFS/EXT2FS). The functions now live in struct ufsmount instead.
4. Remove a kludge of VOP_ functions in all filesystems, that did nothing but obscure the simplicity and break the expandability. If a filesystem doesn't implement VOP_FOO, it shouldn't have an entry for it in its vnops table. The system will try to DTRT if it is not implemented. There are still some cruft left, but the bulk of it is done.
5. Fix another VCALL in vfs_cache.c (thanks Bruce!)
|
30439 |
15-Oct-1997 |
phk |
vnops megacommit
1. Use the default function to access all the specfs operations. 2. Use the default function to access all the fifofs operations. 3. Use the default function to access all the ufs operations. 4. Fix VCALL usage in vfs_cache.c 5. Use VOCALL to access specfs functions in devfs_vnops.c 6. Staticize most of the spec and fifofs vnops functions. 7. Make UFS panic if it lacks bits of the underlying storage handling.
|
30434 |
15-Oct-1997 |
phk |
Hmm, realign the vnops into two columns.
|
30431 |
15-Oct-1997 |
phk |
Stylistic overhaul of vnops tables. 1. Remove comment stating the blatantly obvious. 2. Align in two columns. 3. Sort all but the default element alphabetically. 4. Remove XXX comments pointing out entries not needed.
|
30418 |
14-Oct-1997 |
phk |
I think my previous change may have opened a race conditio. This patch does the same thing, with no change in semantics.
|
30402 |
14-Oct-1997 |
phk |
ufs_ihashrem() should not be called from the UFS layer, but from the lower layer (LFS/FFS/?) like the rest of the ihash functions. Otherwise it is impossible to make a lower layer that doesn't use the ihash facility.
|
30354 |
12-Oct-1997 |
phk |
Last major round (Unless Bruce thinks of somthing :-) of malloc changes.
Distribute all but the most fundamental malloc types. This time I also remembered the trick to making things static: Put "static" in front of them.
A couple of finer points by: bde
|
30309 |
11-Oct-1997 |
phk |
Distribute and statizice a lot of the malloc M_* types.
Substantial input from: bde
|
30283 |
10-Oct-1997 |
phk |
Add type arg to ffs_mountfs and avoid examining v_tag to find out if MFS is getting a free ride.
Use generic ufs_reclaim().
|
29888 |
27-Sep-1997 |
kato |
Clustered read and write are switched at mount-option level.
1. Clustered I/O is switched by the MNT_NOCLUSTERR and MNT_NOCLUSTERW bits of the mnt_flag. The sysctl variables, vfs.foo.doclusterread and vfs.foo.doclusterwrite are deleted. Only mount option can control clustered I/O from userland. 2. When foofs_mount mounts block device, foofs_mount checks D_CLUSTERR and D_CLUSTERW bits of the d_flags member in the block device switch table. If D_NOCLUSTERR / D_NOCLUSTERW are set, MNT_NOCLUSTERR / MNT_NOCLUSTERW bits will be set. In this case, MNT_NOCLUSTERR and MNT_NOCLUSTERW cannot be cleared from userland. 3. Vnode driver disables both clustered read and write. 4. Union filesystem disables clutered write.
Reviewed by: bde
|
29609 |
19-Sep-1997 |
phk |
[Regarding the previous patch] This is completely wrong.
1. ffs_alloc() actually allowed writing one block less one frag (normally 7 frags or 7/8 blocks) beyond the limit. 2. freebufspace() gives the free space in frags, but `size' is in bytes, so the change results in approximately `size' fragments too many being reserved. 3. ffs_realloccg() has the same bug but wasn't changed.
PR: 3398 Submitted by: bde Eyeballed by: phk
|
29581 |
18-Sep-1997 |
phk |
Ffs_alloc allow users to write one block beyond the limit.
PR: 3398 Reviewed by: phk Submitted by: Wolfram Schneider <wosch@apfel.de>
|
29362 |
14-Sep-1997 |
peter |
Convert select -> poll. Delete 'always succeed' select/poll handlers, replaced with generic call. Flag missing vnode op table entries.
|
29208 |
07-Sep-1997 |
bde |
Removed yet more vestiges of config-time swap configuration and/or cleaned up nearby cruft.
|
29041 |
02-Sep-1997 |
bde |
Removed unused #includes.
|
28787 |
26-Aug-1997 |
phk |
Uncut&paste cache_lookup().
This unifies several times in theory indentical 50 lines of code.
The filesystems have a new method: vop_cachedlookup, which is the meat of the lookup, and use vfs_cache_lookup() for their vop_lookup method. vfs_cache_lookup() will check the namecache and pass on to the vop_cachedlookup method in case of a miss.
It's still the task of the individual filesystems to populate the namecache with cache_enter().
Filesystems that do not use the namecache will just provide the vop_lookup method as usual.
|
28701 |
25-Aug-1997 |
kato |
Renamed doclusterread/write to unique names (ffs_doclusterread/write), and staticize them. Move the #include of <sys/sysctl.h> to the top of the file.
Pointed out by: Bruce Evans <bde@zeta.org.au>
|
28270 |
16-Aug-1997 |
wollman |
Fix all areas of the system (or at least all those in LINT) to avoid storing socket addresses in mbufs. (Socket buffers are the one exception.) A number of kernel APIs needed to get fixed in order to make this happen. Also, fix three protocol families which kept PCBs in mbufs to not malloc them instead. Delete some old compatibility cruft while we're at it, and add some new routines in the in_cksum family.
|
27890 |
04-Aug-1997 |
phk |
We got a couple of "map mismatch" panics from the following code. According to the crash dump, bpref is set to 445 and cgp->cg_nclusterblks is 444. Hence in the for loop, the test fails immediately but the following failure check (got == cgp->cg_nclusterblks) doesn't trigger because got > cgp->cg_nclusterblks. This wreaks havoc in the code after that.
Fix: Move one source bit to the left :-)
Noticed by: Mike Hibler <mike@fast.cs.utah.edu> Submitted by: Kirk McKusick <mckusick@McKusick.COM>
|
27845 |
02-Aug-1997 |
bde |
Removed unused #includes.
|
24775 |
10-Apr-1997 |
bde |
Use smalllblktosize() instead of multiplying small block numbers by fs->fs_bsize. The macro is usually faster and makes it clearer that the multiplication can't overflow.
|
24203 |
24-Mar-1997 |
bde |
Don't include <sys/ioctl.h> in the kernel. Stage 1: don't include it when it is not used. In most cases, the reasons for including it went away when the special ioctl headers became self-sufficient.
|
24171 |
24-Mar-1997 |
bde |
Fixed corrupted newline and corrupted tab in previous commit.
|
24149 |
23-Mar-1997 |
guido |
Add generation number randomization. Newly created filesystems wil now automatically have random generation numbers. The kenel way of handling those also changed. Further it is advised to run fsirand on all your nfs exported filesystems. the code is mostly copied from OpenBSD, with the randomization chanegd to use /dev/urandom Reviewed by: Garrett Obtained from: OpenBSD
|
24131 |
23-Mar-1997 |
bde |
Don't #include <sys/fcntl.h> in <sys/file.h> if KERNEL is defined. Fixed everything that depended on getting fcntl.h stuff from the wrong place. Most things don't depend on file.h stuff at all.
|
24101 |
22-Mar-1997 |
bde |
Fixed some invalid (non-atomic) accesses to `time', mostly ones of the form `tv = time'. Use a new function gettime(). The current version just forces atomicicity without fixing precision or efficiency bugs. Simplified some related valid accesses by using the central function.
|
23997 |
18-Mar-1997 |
peter |
Restore the lost MNT_LOCAL flag twiddle. Lite2 has a different mechanism of setting it (compiled into vfs_conf.c), but we have a dynamic system in place. This could probably be better done via a runtime configure flag in the VFS_SET() VFS declaration, perhaps VFCF_LOCAL, and have the VFS code propagate this down into MNT_LOCAL at mount time. The other FS's would need to be updated, havinf UFS and MSDOSFS filesystems without MNT_LOCAL breaks a few things.. the man page rebuild scans for local filesystems and currently fails, I suspect that other tools like find and tar with their "local filesystem only" modes might be affected.
|
23908 |
15-Mar-1997 |
sos |
Fix support for != 512 byte sector devices. Restores the use of SBLOCK instead of the BSOFF/sectorsize calculation. Using SBLOCK is bogus however in that it uses DEV_BSIZE instead of the actual sector size, but that is taken care of in other places. Changing the SBLOCK would be better, but it affects the system in other places, and doing it this way makes it possible to use filesystems that was made before the lite2 merge.
|
23560 |
09-Mar-1997 |
mpp |
Update a number of panic messages to reflect the actual name of the routine that caused the panic.
|
23383 |
04-Mar-1997 |
bde |
Fixed connection of vfs.ffs node to the sysctl tree.
|
22975 |
22-Feb-1997 |
peter |
Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
|
22579 |
12-Feb-1997 |
mpp |
Add function prototypes for most of the new Lite2 functions. Also made a few of the miscfs routines static to be consistent. Some modules simply required some additional #includes to remove -Wall warnings.
|
22544 |
10-Feb-1997 |
mpp |
Correct the new Lite2 #ifdef DIAGNOSTIC ffs_checkblk routine to not return without setting a return value when it can't read a block error or detects a bad cylinder group, since the caller is expecting a return value. It will now panic at this point, since the thing to do in this case would be to return a "bad block" status to the caller, and the caller will panic anyways when that happens.
Also updated to panic strings in this routine to read "ffs_checkblk: ..." instead of "checkblk: ...".
|
22539 |
10-Feb-1997 |
mpp |
Make ffs_subr.c compile when DIAGNOSTIC is defined. It looks like this was broken before the Lite2 merge :-(. VOP_BMAP was being called with the wrong number of arguments.
|
22521 |
10-Feb-1997 |
dyson |
This is the kernel Lite/2 commit. There are some requisite userland changes, so don't expect to be able to run the kernel as-is (very well) without the appropriate Lite/2 userland changes.
The system boots and can mount UFS filesystems.
Untested: ext2fs, msdosfs, NFS Known problems: Incorrect Berkeley ID strings in some files. Mount_std mounts will not work until the getfsent library routine is changed.
Reviewed by: various people Submitted by: Jeffery Hsu <hsu@freebsd.org>
|
21673 |
14-Jan-1997 |
jkh |
Make the long-awaited change from $Id$ to $FreeBSD$
This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long.
Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
|
19700 |
13-Nov-1996 |
julian |
Submitted by: Archie and me.
We encountered an interesting situation where the superblock for a file system got written to disk with the "fs_fmod" flag set to one. It appears that this flag is normally supposed to be cleared during ffs_sync(), but we experienced a crash, or some other weird occurrence that left it on the disk set to 1.
Later this partition was mounted read-only... and the fs_fmod field was never cleared, causing ffs_sync() to panic "rofs mod" when trying to unmount that filesystem (ffs_vfsops.c: line 790).
fix: set this bit to 0 when you load the superblock from disk. (see more complete mail on this to hackers)
|
19424 |
05-Nov-1996 |
dg |
Eliminate an unnecessary synchronous write (and an 8K bcopy+bzero) when truncating/deleting large files.
Reviewed by: mckusick, dyson Submitted by: Kirk McKusick <mckusick@mckusick.com>, modified for FreeBSD by me.
|
18899 |
12-Oct-1996 |
bde |
Fixed lblktosize(). It overflowed at 2G. This bug only affected ufs_read() and ufs_write().
Found by: looking at warnings for comparing the result of lblktosize() (which is usually daddr_t = long) with file sizes (which are u_quad_t for ufs). File sizes should probably be off_t's to avoid warnings when the are compared with file offsets, so the fixed lblktosize() casts to off_t instead of u_quad_t.
Added definition of smalllblksize(). It is the same as the old lblksize() and is more efficient for small block numbers on 32-bit machines.
Use smalllblktosize() instead of its expansion in blksize() and dblksize(). This keeps the line length short and makes it more obvious that the shift can't overflow.
|
18397 |
19-Sep-1996 |
nate |
In sys/time.h, struct timespec is defined as:
/* * Structure defined by POSIX.4 to be like a timeval. */ struct timespec { time_t ts_sec; /* seconds */ long ts_nsec; /* and nanoseconds */ };
The correct names of the fields are tv_sec and tv_nsec.
Reminded by: James Drobina <jdrobina@infinet.com>
|
18330 |
17-Sep-1996 |
peter |
Argh, I have had one "uid 0 on /: file system full" too many. The problem is that it doesn't say _what_ did it! (the core dumped console message is very useful for listing the process name and pid). This adds similar information.
|
18104 |
07-Sep-1996 |
dyson |
Fix a VOP_UNLOCK panic when using options DIAGNOSTIC during dismount.
|
17761 |
21-Aug-1996 |
dyson |
Even though this looks like it, this is not a complex code change. The interface into the "VMIO" system has changed to be more consistant and robust. Essentially, it is now no longer necessary to call vn_open to get merged VM/Buffer cache operation, and exceptional conditions such as merged operation of VBLK devices is simpler and more correct.
This code corrects a potentially large set of problems including the problems with ktrace output and loaded systems, file create/deletes, etc.
Most of the changes to NFS are cosmetic and name changes, eliminating a layer of subroutine calls. The direct calls to vput/vrele have been re-instituted for better cross platform compatibility.
Reviewed by: davidg
|
17108 |
12-Jul-1996 |
bde |
Don't use NULL in non-pointer contexts.
|
16312 |
12-Jun-1996 |
dg |
Moved the fsnode MALLOC to before the call to getnewvnode() so that the process won't possibly block before filling in the fsnode pointer (v_data) which might be dereferenced during a sync since the vnode is put on the mnt_vnodelist by getnewvnode.
Pointed out by Matt Day <mday@artisoft.com>
|
15680 |
08-May-1996 |
gpalmer |
Clean up various compiler warnings. Most (if not all) were benign
Reviewed by: bde
|
15493 |
01-May-1996 |
bde |
Removed bogus _BEGIN_DECLS/_END_DECLS.
Removed unused struct tag declarations in cloned code.
Added or cleaned up idempotency ifdefs.
|
14345 |
02-Mar-1996 |
dyson |
Handle the bogus device that MFS uses as its VBLK device. We now don't try to VMIO open it on MFS mounts. This will fix the mfs_badops panic.
|
14317 |
02-Mar-1996 |
dyson |
Enable VMIO for non-VDIR metadata and block device.
|
14249 |
25-Feb-1996 |
bde |
Removed vestigial support for the obsolete FIFO option. In ext2fs it caused null pointer panics for all fifo operations unless FIFO was defined.
|
13765 |
30-Jan-1996 |
mpp |
Fix a bunch of spelling errors in the comment fields of a bunch of system include files.
|
13490 |
19-Jan-1996 |
dyson |
Eliminated many redundant vm_map_lookup operations for vm_mmap. Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish overhead for merged cache. Efficiency improvement for vfs_cluster. It used to do alot of redundant calls to cluster_rbuild. Correct the ordering for vrele of .text and release of credentials. Use the selective tlb update for 486/586/P6. Numerous fixes to the size of objects allocated for files. Additionally, fixes in the various pagers. Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs. Fixes in the swap pager for exhausted resources. The pageout code will not as readily thrash. Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE), thereby improving efficiency of several routines. Eliminate even more unnecessary vm_page_protect operations. Significantly speed up process forks. Make vm_object_page_clean more efficient, thereby eliminating the pause that happens every 30seconds. Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the case of filesystems mounted async. Fix a panic with busy pages when write clustering is done for non-VMIO buffers.
|
13424 |
14-Jan-1996 |
bde |
Partially fixed negative and truncated "Avail" counts in df output. This fixes PR943.
ffs/ffs_vfsops.c: ffs_statfs() multiplied by (100 - minfree) as part of calculating the minfree percentage (complemented in 100%), so with the standard minfree of 8, it was broken for file systems of size >= 1TB/92 = 11GB. Use the standard freespace() macro instead. This also fixes a rounding bug (the "Avail" count was sometimes 1 too small).
ffs/* (not fixed): The freespace() macro multiplies by minfree, so with the standard minfree of 8, it is broken for file systems of size >= 1TB/8 = 128GB. This bug is more serious since it affects block allocation.
ffs/ffs_alloc.c (not fixed): Ordinary users are sometimes allowed to allocate 1 (partial) block too many so that the "Avail" count goes negative. E.g., if there is 1 fragment available and the file is fairly large, one more full block is allocated.
df/df.c: ufs_df() used/uses essentially the same code as ffs_statfs(), so it had/has the same bugs.
ufs_df() gratuitously replaced "Avail" counts of < 0 by 0, so it gave different results for non-mounted file systems in this case.
|
13260 |
05-Jan-1996 |
wollman |
Convert QUOTA to new-style option.
|
13228 |
04-Jan-1996 |
wollman |
Convert DDB to new-style option.
|
12911 |
17-Dec-1995 |
phk |
Staticize.
|
12861 |
15-Dec-1995 |
peter |
Silence a harmless warning...
|
12767 |
11-Dec-1995 |
dyson |
Changes to support 1Tb filesizes. Pages are now named by an (object,index) pair instead of (object,offset) pair.
|
12662 |
07-Dec-1995 |
dg |
Untangled the vm.h include file spaghetti.
|
12590 |
03-Dec-1995 |
bde |
Completed function declarations and/or added prototypes and/or #includes to get the prototypes.
|
12424 |
20-Nov-1995 |
phk |
Fix compiler warnings.
|
12405 |
19-Nov-1995 |
dyson |
General fixes to the vfs clustring code:
1) Make cluster buffer list be a non-malloced chain. This eliminates yet another 'evil' M_WAITOK and generally cleans up the code. 2) Fix write clustering for ext2fs. It was just broken. Also, ffs clustering had an efficiency problem that more bawrites were happening than should have been. 3) Make changes to buf.h to support the above, plus remove b_pfcent at the request of David Greenman.
Note that the reallocblocks code is disabled pending rewrite for the cluster buffer list changes.
|
12288 |
14-Nov-1995 |
phk |
Get rid of the last debug sysctl variables of the old style.
|
12158 |
09-Nov-1995 |
bde |
Introduced a type `vop_t' for vnode operation functions and used it 1138 times (:-() in casts and a few more times in declarations. This change is null for the i386.
The type has to be `typedef int vop_t(void *)' and not `typedef int vop_t()' because `gcc -Wstrict-prototypes' warns about the latter. Since vnode op functions are called with args of different (struct pointer) types, neither of these function types is any use for type checking of the arg, so it would be preferable not to use the complete function type, especially since using the complete type requires adding 1138 casts to avoid compiler warnings and another 40+ casts to reverse the function pointer conversions before calling the functions.
|
12111 |
05-Nov-1995 |
dyson |
Make MNT_ASYNC more effective for UFS. It should not be too much more dangerous than the original MNT_ASYNC. There might be some minor security considerations due to data writes not being posted as promptly as before. Meta-data operations are still not quite as fast as Linux, but streaming I/O is still higher.
|
11701 |
23-Oct-1995 |
dyson |
Finalize GETPAGES layering scheme. Move the device GETPAGES interface into specfs code. No need at this point to modify the PUTPAGES stuff except in the layered-type (NULL/UNION) filesystems.
|
10998 |
25-Sep-1995 |
dyson |
Re-enable read clustering.
|
10949 |
22-Sep-1995 |
dg |
Shit! I changed the wrong doclusterread! ...Thanks to Steven Wallace and Poul-Henning for convincing me that I should look at my mistake! :-)
|
10946 |
22-Sep-1995 |
dg |
Disable file read clustering until the bug(s) in vfs_cluster.c are fixed. This should temporarily fix the sig 10/11 problems that people have been having for the past 3 weeks.
|
10632 |
08-Sep-1995 |
dg |
Slight optimization for the standard case of rotdelay=0.
|
10578 |
06-Sep-1995 |
dyson |
Added indirect pointer for ffs_getpages, and added external declaration.
|
10551 |
04-Sep-1995 |
dyson |
Added VOP_GETPAGES/VOP_PUTPAGES and also the "backwards" block count for VOP_BMAP. Updated affected filesystems...
|
10358 |
28-Aug-1995 |
julian |
Reviewed by: julian with quick glances by bruce and others Submitted by: terry (terry lambert) This is a composite of 3 patch sets submitted by terry. they are: New low-level init code that supports loadbal modules better some cleanups in the namei code to help terry in 16-bit character support some changes to the mount-root code to make it a little more modular..
NOTE: mounting root off cdrom or NFS MIGHT be broken as I haven't been able to test those cases..
certainly mounting root of disk still works just fine.. mfs should work but is untested. (tomorrows task)
The low level init stuff includes a total rewrite of init_main.c to make it possible for new modules to have an init phase by simply adding an entry to a TEXT_SET (or is it DATA_SET) list. thus a new module can be added to the kernel without editing any other files other than the 'files' file.
|
10269 |
25-Aug-1995 |
bde |
Don't call VOP_UPDATE() with volatile timestamps.
|
10078 |
16-Aug-1995 |
dg |
Honor -async mount option when doing the inode update.
Obtained from: 4.4BSD-Lite2
|
10027 |
11-Aug-1995 |
dg |
Converted mountlist to a CIRCLEQ.
Partially obtained from: 4.4BSD-Lite2
|
9980 |
07-Aug-1995 |
dg |
Use bdwrite() rather than brelse(). The cylinder group bitmap modification is not preserved otherwise. Note that this is a no-op in FreeBSD, however, as we have doreallocblks disabled.
Submitted by: Kirk McKusick
|
9967 |
06-Aug-1995 |
dg |
Removed redundant call to vm_object_page_clean - this is already done in vfs_msync().
|
9886 |
04-Aug-1995 |
dg |
Use the correct flags (IO_SYNC -> B_SYNC) when deciding to do a sync or async write in the section that changes the filesize. The bug resulted in the updates always being async.
Obtained from: 4.4BSD-Lite2
|
9618 |
21-Jul-1995 |
dg |
Since ufs_ihashget can block, the lock must be checked for each time the function returns. Also, moved lock into .bss and made minor cosmetic changes.
Submitted by: Bruce Evans
|
9601 |
21-Jul-1995 |
dg |
Implement a lock in ffs_vget to prevent a race condition where two processes try allocate the same inode/vnode, causing a duplicate.
Submitted by: Matt Dillon, slightly reworked by me.
|
9507 |
13-Jul-1995 |
dg |
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct proc or any VM system structure will have to be rebuilt!!!
Much needed overhaul of the VM system. Included in this first round of changes:
1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages, haspage, and sync operations are supported. The haspage interface now provides information about clusterability. All pager routines now take struct vm_object's instead of "pagers".
2) Improved data structures. In the previous paradigm, there is constant confusion caused by pagers being both a data structure ("allocate a pager") and a collection of routines. The idea of a pager structure has escentially been eliminated. Objects now have types, and this type is used to index the appropriate pager. In most cases, items in the pager structure were duplicated in the object data structure and thus were unnecessary. In the few cases that remained, a un_pager structure union was created in the object to contain these items.
3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now be removed. For instance, vm_object_enter(), vm_object_lookup(), vm_object_remove(), and the associated object hash list were some of the things that were removed.
4) simple_lock's removed. Discussion with several people reveals that the SMP locking primitives used in the VM system aren't likely the mechanism that we'll be adopting. Even if it were, the locking that was in the code was very inadequate and would have to be mostly re-done anyway. The locking in a uni-processor kernel was a no-op but went a long way toward making the code difficult to read and debug.
5) Places that attempted to kludge-up the fact that we don't have kernel thread support have been fixed to reflect the reality that we are really dealing with processes, not threads. The VM system didn't have complete thread support, so the comments and mis-named routines were just wrong. We now use tsleep and wakeup directly in the lock routines, for instance.
6) Where appropriate, the pagers have been improved, especially in the pager_alloc routines. Most of the pager_allocs have been rewritten and are now faster and easier to maintain.
7) The pagedaemon pageout clustering algorithm has been rewritten and now tries harder to output an even number of pages before and after the requested page. This is sort of the reverse of the ideal pagein algorithm and should provide better overall performance.
8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup have been removed. Some other unnecessary casts have also been removed.
9) Some almost useless debugging code removed.
10) Terminology of shadow objects vs. backing objects straightened out. The fact that the vm_object data structure escentially had this backwards really confused things. The use of "shadow" and "backing object" throughout the code is now internally consistent and correct in the Mach terminology.
11) Several minor bug fixes, including one in the vm daemon that caused 0 RSS objects to not get purged as intended.
12) A "default pager" has now been created which cleans up the transition of objects to the "swap" type. The previous checks throughout the code for swp->pg_data != NULL were really ugly. This change also provides the rudiments for future backing of "anonymous" memory by something other than the swap pager (via the vnode pager, for example), and it allows the decision about which of these pagers to use to be made dynamically (although will need some additional decision code to do this, of course).
13) (dyson) MAP_COPY has been deprecated and the corresponding "copy object" code has been removed. MAP_COPY was undocumented and non- standard. It was furthermore broken in several ways which caused its behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will continue to work correctly, but via the slightly different semantics of MAP_PRIVATE.
14) (dyson) Sharing maps have been removed. It's marginal usefulness in a threads design can be worked around in other ways. Both #12 and #13 were done to simplify the code and improve readability and maintain- ability. (As were most all of these changes)
TODO:
1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing this will reduce the vnode pager to a mere fraction of its current size.
2) Rewrite vm_fault and the swap/vnode pagers to use the clustering information provided by the new haspage pager interface. This will substantially reduce the overhead by eliminating a large number of VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be improved to provide both a "behind" and "ahead" indication of contiguousness.
3) Implement the extended features of pager_haspage in swap_pager_haspage(). It currently just says 0 pages ahead/behind.
4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps via a much more general mechanism that could also be used for disk striping of regular filesystems.
5) Do something to improve the architecture of vm_object_collapse(). The fact that it makes calls into the swap pager and knows too much about how the swap pager operates really bothers me. It also doesn't allow for collapsing of non-swap pager objects ("unnamed" objects backed by other pagers).
|
9356 |
28-Jun-1995 |
dg |
1) Converted v_vmdata to v_object. 2) Removed unnecessary vm_object_lookup()/pager_cache(object, TRUE) pairs after vnode_pager_alloc() calls - the object is already guaranteed to be persistent. 3) Removed some gratuitous casts.
|
8876 |
30-May-1995 |
rgrimes |
Remove trailing whitespace.
|
8805 |
28-May-1995 |
dg |
Kill bogus vnode_pager_setsize(). It was being called at the wrong time and resulted in the object size being too small. This caused bad things to happen later when the file was mapped.
Reviewed by: John Dyson
|
8692 |
21-May-1995 |
dg |
Changes to fix the following bugs:
1) Files weren't properly synced on filesystems other than UFS. In some cases, this lead to lost data. Most likely would be noticed on NFS. The fix is to make the VM page sync/object_clean general rather than in each filesystem. 2) Mixing regular and mmaped file I/O on NFS was very broken. It caused chunks of files to end up as zeroes rather than the intended contents. The fix was to fix several race conditions and to kludge up the "b_dirtyoff" and "b_dirtyend" that NFS relies upon - paying attention to page modifications that occurred via the mmapping.
Reviewed by: David Greenman Submitted by: John Dyson
|
8624 |
19-May-1995 |
dg |
NFS diskless operation was broken because swapdev_vp wasn't initialized. These changes solve the problem in a general way by moving the initialization out of the individual fs_mountroot's and into swaponvp().
Submitted by: Poul-Henning Kamp
|
8530 |
15-May-1995 |
dg |
Fixed incompleteness that would allow dirty filesystems to get mounted when the single user shell was terminated. These changes disallow mounting or R/W upgrading filesystems that are dirty unless "-f" (force) option is used with mount. /etc/rc has been modified to abort the startup if one or more non-nfs partitions fail to mount.
Reviewed by: Poul-Henning Kamp, Rod Grimes
|
8456 |
11-May-1995 |
rgrimes |
Fix -Wformat warnings from LINT kernel.
|
8210 |
01-May-1995 |
dyson |
Limit filesize to the amount that the VM system can currently handle (2GB). If this limit is not imposed, then filesystem corruption will ensue when files larger than 2GB are created. This is temporary, and the underlying limitation will be removed later.
|
7752 |
11-Apr-1995 |
dg |
Handle the "syncing VCHR vnode hang" problem a little differently; just don't lock the vnode - it doesn't appear to ever be necessary for VCHR vnode/inodes. This fixes a bug introduced in the previous commit that caused tty timestamps to act strange (causing 'w' and 'finger' to show the tty wasn't idle when it may have been for hours).
|
7695 |
09-Apr-1995 |
dg |
Changes from John Dyson and myself:
Fixed remaining known bugs in the buffer IO and VM system.
vfs_bio.c: Fixed some race conditions and locking bugs. Improved performance by removing some (now) unnecessary code and fixing some broken logic. Fixed process accounting of # of FS outputs. Properly handle NFS interrupts (B_EINTR).
(various) Replaced calls to clrbuf() with calls to an optimized routine called vfs_bio_clrbuf().
(various FS sync) Sync out modified vnode_pager backed pages.
ffs_vnops.c: Do two passes: Sync out file data first, then indirect blocks.
vm_fault.c: Fixed deadly embrace caused by acquiring locks in the wrong order.
vnode_pager.c: Changed to use buffer I/O system for writing out modified pages. This should fix the problem with the modification date previous not getting updated. Also dramatically simplifies the code. Note that this is going to change in the future and be implemented via VOP_PUTPAGES().
vm_object.c: Fixed a pile of bugs related to cleaning (vnode) objects. The performance of vm_object_page_clean() is terrible when dealing with huge objects, but this will change when we implement a binary tree to keep the object pages sorted.
vm_pageout.c: Fixed broken clustering of pageouts. Fixed race conditions and other lockup style bugs in the scanning of pages. Improved performance.
|
7430 |
28-Mar-1995 |
bde |
Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) that I didn't notice when I fixed "all" such warnings before.
|
7399 |
26-Mar-1995 |
dg |
Removed third arg (vmio) to allocbuf() that was added with the original merged cache changes, and figure it out based on the B_VMIO buffer flag. Fixes a problem where delayed write VMIO buffers would sometimes get recopied into kernel-alloced memory.
Submitted by: John Dyson
|
7170 |
19-Mar-1995 |
dg |
Removed redundant newlines that were in some panic strings.
|
7145 |
18-Mar-1995 |
dg |
Don't sync the inode date changes of character special devices during the FS sync. The system would appear to hang momentarily if there was a large backlog of I/O. This is because the vnode remains locked during the output - preventing normal character I/O. The problem was exacerbated by the FFS contiguous block allocation fixes and a semi-broken disksort(). The inode/date will still be synced during a normal FS dismount and whenever the inode is changed for other reasons.
|
7090 |
16-Mar-1995 |
bde |
Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) and most of the warnings from `gcc -Wnested-externs'. Fix all the bugs found. There were no serious ones.
|
6994 |
10-Mar-1995 |
dg |
Increased default minfree to 8%.
|
6993 |
10-Mar-1995 |
dg |
The threshold for switching from time-space and space-time is too small when minfree is 5%...so make it stay at space in this case.
Submitted by: Kirk McKusick
|
6875 |
04-Mar-1995 |
dg |
Removed obsolete vtrace() remnants.
|
6864 |
03-Mar-1995 |
dg |
Fixes from John Dyson to work around vnode lock hang. Basically, remove the VOP_BMAP calls, and add one to bdwrite.
Submitted by: John Dyson
|
6769 |
27-Feb-1995 |
se |
Don't try to make use of useless rotational position optimisation, if all free blocks are in the same bucket (i.e. NRPOS == 1). Else a free block is choosen, possibly from a different cylinder, even if the block succeeding bpref was free ...
Submitted by: se
|
6357 |
14-Feb-1995 |
phk |
YF fix.
|
5455 |
09-Jan-1995 |
dg |
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme.
vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering.
vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption.
vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs.
vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore.
machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme.
machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers.
Submitted by: John Dyson and David Greenman
|
5248 |
27-Dec-1994 |
bde |
Use the same current time throughout ffs_update().
Update some macro names in comments.
Don't use MNT_WAIT for something not related to mounting.
|
4463 |
14-Nov-1994 |
bde |
Undo a previous change. <sys/disklabel.h> was broken, not these files.
|
3962 |
28-Oct-1994 |
jkh |
From: fredriks@mcs.com (Lars Fredriksen) ... It turns out that these files do not include <sys/dkbad.h> before <sys/disklabel.h>. Submitted by: fredriks
|
3768 |
22-Oct-1994 |
dg |
Restrict fs_maxfilesize to 2^40, and check against this in ffs_truncate(). This is part of a bug fix from Kirk McKusick to work around problems in FFS related to the blkno of a 64bit offset not fitting into an int. Note the proper solution would be to deal with 64bit block numbers, but doing this would require sweeping changes; some other day perhaps.
Submitted by: Marshall Kirk McKusick
|
3487 |
10-Oct-1994 |
phk |
Cosmetics. make gcc less noisy. Still some way to go here.
|
3425 |
08-Oct-1994 |
phk |
Cosmetics for gcc -Wall. A couple of unused "int i"'s removed and a couple of prototypes added. And the usual () work.
|
3396 |
06-Oct-1994 |
dg |
Use tsleep() rather than sleep so that 'ps' is more informative about the wait.
|
2979 |
22-Sep-1994 |
wollman |
More loadable VFS changes:
- Make a number of filesystems work again when they are statically compiled (blush)
- FIFOs are no longer optional; ``options FIFO'' removed from distributed config files.
|
2967 |
22-Sep-1994 |
wollman |
Call ffs ``ufs'' for the benefit of poor, confused user-land programs.
|
2946 |
21-Sep-1994 |
wollman |
Implemented loadable VFS modules, and made most existing filesystems loadable. (NFS is a notable exception.)
|
2922 |
20-Sep-1994 |
bde |
Use `1' for a boolean value instead of something irrelevant (MNT_WAIT) that happens to be nonzero.
|
2460 |
02-Sep-1994 |
dg |
panic if length is < 0 in ffs_truncate().
|
2384 |
29-Aug-1994 |
dg |
"bogus" fixes from 1.1.5 to work around some cache coherency problems.
|
2176 |
21-Aug-1994 |
paul |
Made idempotent Reviewed by: Submitted by:
|
2152 |
20-Aug-1994 |
dg |
Implemented filesystem clean bit via:
machdep.c: Changed printf's a little and call vfs_unmountall() if the sync was successful.
cd9660_vfsops.c, ffs_vfsops.c, nfs_vfsops.c, lfs_vfsops.c: Allow dismount of root FS. It is now disallowed at a higher level.
vfs_conf.c: Removed unused rootfs global.
vfs_subr.c: Added new routines vfs_unmountall and vfs_unmountroot. Filesystems are now dismounted if the machine is properly rebooted.
ffs_vfsops.c: Toggle clean bit at the appropriate places. Print warning if an unclean FS is mounted.
ffs_vfsops.c, lfs_vfsops.c: Fix bug in selecting proper flags for VOP_CLOSE().
vfs_syscalls.c: Disallow dismounting root FS via umount syscall.
|
2112 |
18-Aug-1994 |
wollman |
Fix up some sloppy coding practices:
- Delete redundant declarations. - Add -Wredundant-declarations to Makefile.i386 so they don't come back. - Delete sloppy COMMON-style declarations of uninitialized data in header files. - Add a few prototypes. - Clean up warnings resulting from the above.
NB: ioconf.c will still generate a redundant-declaration warning, which is unavoidable unless somebody volunteers to make `config' smarter.
|
1960 |
08-Aug-1994 |
dg |
Made lockf advisory locking code generic (rather than ufs specific), and use it in NFS. This is required both for diskless support and for POSIX compliance. Note: the support in NFS is only for the local node.
Submitted by: based on work originally done by Yuval Yurom
|
1826 |
03-Aug-1994 |
dg |
Changed occurrances of "itrunc" to "ffs_truncate" to make Bruce happy.
|
1821 |
02-Aug-1994 |
dg |
Completed (hopefully) the kernel support for old style "fastlinks".
|
1817 |
02-Aug-1994 |
dg |
Added $Id$
|
1549 |
25-May-1994 |
rgrimes |
The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.
Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
|
1542 |
24-May-1994 |
rgrimes |
This commit was generated by cvs2svn to compensate for changes in r1541, which included commits to RCS files with non-trunk default branches.
|