#
22650a99 |
|
26-Mar-2024 |
Christian Brauner <brauner@kernel.org> |
fs,block: yield devices early Currently a device is only really released once the umount returns to userspace due to how file closing works. That ultimately could cause an old umount assumption to be violated that concurrent umount and mount don't fail. So an exclusively held device with a temporary holder should be yielded before the filesystem is gone. Add a helper that allows callers to do that. This also allows us to remove the two holder ops that Linus wasn't excited about. Link: https://lore.kernel.org/r/20240326-vfs-bdev-end_holder-v1-1-20af85202918@kernel.org Fixes: f3a608827d1f ("bdev: open block device as files") # mainline only Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
7fbaab57 |
|
22-Feb-2024 |
Darrick J. Wong <djwong@kernel.org> |
xfs: port refcount repair to the new refcount bag structure Port the refcount record generating code to use the new refcount bag data structure. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
86a1746e |
|
22-Feb-2024 |
Darrick J. Wong <djwong@kernel.org> |
xfs: track directory entry updates during live nlinks fsck Create the necessary hooks in the directory operations (create/link/unlink/rename) code so that our live nlink scrub code can stay up to date with link count updates in the rest of the filesystem. This will be the means to keep our shadow link count information up to date while the scan runs in real time. In online fsck part 2, we'll use these same hooks to handle repairs to directories and parent pointer information. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
785dd131 |
|
19-Feb-2024 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
xfs: Remove mrlock wrapper mrlock was an rwsem wrapper that also recorded whether the lock was held for read or write. Now that we can ask the generic code whether the lock is held for read or write, we can remove this wrapper and use an rwsem directly. As the comment says, we can't use lockdep to assert that the ILOCK is held for write, because we might be in a workqueue, and we aren't able to tell lockdep that we do in fact own the lock. Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
#
d4c75a1b |
|
15-Jan-2024 |
Dave Chinner <dchinner@redhat.com> |
xfs: convert remaining kmem_free() to kfree() The remaining callers of kmem_free() are freeing heap memory, so we can convert them directly to kfree() and get rid of kmem_free() altogether. This conversion was done with: $ for f in `git grep -l kmem_free fs/xfs`; do > sed -i s/kmem_free/kfree/ $f > done $ Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
#
f078d4ea |
|
15-Jan-2024 |
Dave Chinner <dchinner@redhat.com> |
xfs: convert kmem_alloc() to kmalloc() kmem_alloc() is just a thin wrapper around kmalloc() these days. Convert everything to use kmalloc() so we can get rid of the wrapper. Note: the transaction region allocation in xlog_add_to_transaction() can be a high order allocation. Converting it to use kmalloc(__GFP_NOFAIL) results in warnings in the page allocation code being triggered because the mm subsystem does not want us to use __GFP_NOFAIL with high order allocations like we've been doing with the kmem_alloc() wrapper for a couple of decades. Hence this specific case gets converted to xlog_kvmalloc() rather than kmalloc() to avoid this issue. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
#
f88c3fb8 |
|
12-Mar-2024 |
Linus Torvalds <torvalds@linux-foundation.org> |
mm, slab: remove last vestiges of SLAB_MEM_SPREAD Yes, yes, I know the slab people were planning on going slow and letting every subsystem fight this thing on their own. But let's just rip off the band-aid and get it over and done with. I don't want to see a number of unnecessary pull requests just to get rid of a flag that no longer has any meaning. This was mainly done with a couple of 'sed' scripts and then some manual cleanup of the end result. Link: https://lore.kernel.org/all/CAHk-=wji0u+OOtmAOD-5JV3SXcRJF___k_+8XNKmak0yd5vW1Q@mail.gmail.com/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
1b9e2d90 |
|
23-Jan-2024 |
Christian Brauner <brauner@kernel.org> |
xfs: port block device access to files Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-7-adbd023e19cc@kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
f3a60882 |
|
08-Feb-2024 |
Christian Brauner <brauner@kernel.org> |
bdev: open block device as files Add two new helpers to allow opening block devices as files. This is not the final infrastructure. This still opens the block device before opening a struct a file. Until we have removed all references to struct bdev_handle we can't switch the order: * Introduce blk_to_file_flags() to translate from block specific to flags usable to pen a new file. * Introduce bdev_file_open_by_{dev,path}(). * Introduce temporary sb_bdev_handle() helper to retrieve a struct bdev_handle from a block device file and update places that directly reference struct bdev_handle to rely on it. * Don't count block device openes against the number of open files. A bdev_file_open_by_{dev,path}() file is never installed into any file descriptor table. One idea that came to mind was to use kernel_tmpfile_open() which would require us to pass a path and it would then call do_dentry_open() going through the regular fops->open::blkdev_open() path. But then we're back to the problem of routing block specific flags such as BLK_OPEN_RESTRICT_WRITES through the open path and would have to waste FMODE_* flags every time we add a new one. With this we can avoid using a flag bit and we have more leeway in how we open block devices from bdev_open_by_{dev,path}(). Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-1-adbd023e19cc@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
27c86d43 |
|
15-Sep-2023 |
Shiyang Ruan <ruansy.fnst@fujitsu.com> |
xfs: drop experimental warning for FSDAX FSDAX and reflink can work together now, let's drop this warning. Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
#
d8d222e0 |
|
15-Jan-2024 |
Dave Chinner <dchinner@redhat.com> |
xfs: read only mounts with fsopen mount API are busted Recently xfs/513 started failing on my test machines testing "-o ro,norecovery" mount options. This was being emitted in dmesg: [ 9906.932724] XFS (pmem0): no-recovery mounts must be read-only. Turns out, readonly mounts with the fsopen()/fsconfig() mount API have been busted since day zero. It's only taken 5 years for debian unstable to start using this "new" mount API, and shortly after this I noticed xfs/513 had started to fail as per above. The syscall trace is: fsopen("xfs", FSOPEN_CLOEXEC) = 3 mount_setattr(-1, NULL, 0, NULL, 0) = -1 EINVAL (Invalid argument) ..... fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/pmem0", 0) = 0 fsconfig(3, FSCONFIG_SET_FLAG, "ro", NULL, 0) = 0 fsconfig(3, FSCONFIG_SET_FLAG, "norecovery", NULL, 0) = 0 fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = -1 EINVAL (Invalid argument) close(3) = 0 Showing that the actual mount instantiation (FSCONFIG_CMD_CREATE) is what threw out the error. During mount instantiation, we call xfs_fs_validate_params() which does: /* No recovery flag requires a read-only mount */ if (xfs_has_norecovery(mp) && !xfs_is_readonly(mp)) { xfs_warn(mp, "no-recovery mounts must be read-only."); return -EINVAL; } and xfs_is_readonly() checks internal mount flags for read only state. This state is set in xfs_init_fs_context() from the context superblock flag state: /* * Copy binary VFS mount flags we are interested in. */ if (fc->sb_flags & SB_RDONLY) set_bit(XFS_OPSTATE_READONLY, &mp->m_opstate); With the old mount API, all of the VFS specific superblock flags had already been parsed and set before xfs_init_fs_context() is called, so this all works fine. However, in the brave new fsopen/fsconfig world, xfs_init_fs_context() is called from fsopen() context, before any VFS superblock have been set or parsed. Hence if we use fsopen(), the internal XFS readonly state is *never set*. Hence anything that depends on xfs_is_readonly() actually returning true for read only mounts is broken if fsopen() has been used to mount the filesystem. Fix this by moving this internal state initialisation to xfs_fs_fill_super() before we attempt to validate the parameters that have been set prior to the FSCONFIG_CMD_CREATE call being made. Signed-off-by: Dave Chinner <dchinner@redhat.com> Fixes: 73e5fff98b64 ("xfs: switch to use the new mount-api") cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
#
646ddf0c |
|
04-Dec-2023 |
Christoph Hellwig <hch@lst.de> |
xfs: clean up the xfs_reserve_blocks interface xfs_reserve_blocks has a very odd interface that can only be explained by it directly deriving from the IRIX fcntl handler back in the day. Split reporting out the reserved blocks out of xfs_reserve_blocks into the only caller that cares. This means that the value reported from XFS_IOC_SET_RESBLKS isn't atomically sampled in the same critical section as when it was set anymore, but as the values could change right after setting them anyway that does not matter. It does provide atomic sampling of both values for XFS_IOC_GET_RESBLKS now, though. Also pass a normal scalar integer value for the requested value instead of the pointless pointer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
#
3584c8f4 |
|
01-Nov-2023 |
Jan Kara <jack@suse.cz> |
xfs: Block writes to log device Ask block layer to not allow other writers to open block devices used for xfs log and realtime devices. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231101174325.10596-6-jack@suse.cz Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
653bee38 |
|
24-Oct-2023 |
Christian Brauner <brauner@kernel.org> |
xfs: simplify device handling We removed all codepaths where s_umount is taken beneath open_mutex and bd_holder_lock so don't make things more complicated than they need to be and hold s_umount over block device opening. Link: https://lore.kernel.org/r/20231024-vfs-super-rework-v1-2-37a8aa697148@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
fa5a3872 |
|
16-Oct-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: create a helper to convert rtextents to rtblocks Create a helper to convert a realtime extent to a realtime block. Later on we'll change the helper to use bit shifts when possible. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
e340dd63 |
|
27-Sep-2023 |
Jan Kara <jack@suse.cz> |
xfs: Convert to bdev_open_by_path() Convert xfs to use bdev_open_by_path() and pass the handle around. CC: "Darrick J. Wong" <djwong@kernel.org> CC: linux-xfs@vger.kernel.org Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-28-jack@suse.cz Acked-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
ef7d9593 |
|
11-Sep-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: remove CPU hotplug infrastructure There are no users of the cpu hotplug hooks in xfs now, so remove it. This reverts f1653c2e2831e ("xfs: introduce CPU hotplug infrastructure"). Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
f5bfa695 |
|
11-Sep-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: remove the all-mounts list Revert commit 0ed17f01c8540 ("xfs: introduce all-mounts list for cpu hotplug notifications") because the cpu hotplug hooks are now pointless, so we don't need this list anymore. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
62334fab |
|
11-Sep-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: use per-mount cpumask to track nonempty percpu inodegc lists Directly track which CPUs have contributed to the inodegc percpu lists instead of trusting the cpu online mask. This eliminates a theoretical problem where the inodegc flush functions might fail to flush a CPU's inodes if that CPU happened to be dying at exactly the same time. Most likely nobody's noticed this because the CPU dead hook moves the percpu inodegc list to another CPU and schedules that worker immediately. But it's quite possible that this is a subtle race leading to UAF if the inodegc flush were part of an unmount. Further benefits: This reduces the overhead of the inodegc flush code slightly by allowing us to ignore CPUs that have empty lists. Better yet, it reduces our dependence on the cpu online masks, which have been the cause of confusion and drama lately. Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
ecd49f7a |
|
11-Sep-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: fix per-cpu CIL structure aggregation racing with dying cpus In commit 7c8ade2121200 ("xfs: implement percpu cil space used calculation"), the XFS committed (log) item list code was converted to use per-cpu lists and space tracking to reduce cpu contention when multiple threads are modifying different parts of the filesystem and hence end up contending on the log structures during transaction commit. Each CPU tracks its own commit items and space usage, and these do not have to be merged into the main CIL until either someone wants to push the CIL items, or we run over a soft threshold and switch to slower (but more accurate) accounting with atomics. Unfortunately, the for_each_cpu iteration suffers from the same race with cpu dying problem that was identified in commit 8b57b11cca88f ("pcpcntrs: fix dying cpu summation race") -- CPUs are removed from cpu_online_mask before the CPUHP_XFS_DEAD callback gets called. As a result, both CIL percpu structure aggregation functions fail to collect the items and accounted space usage at the correct point in time. If we're lucky, the items that are collected from the online cpus exceed the space given to those cpus, and the log immediately shuts down in xlog_cil_insert_items due to the (apparent) log reservation overrun. This happens periodically with generic/650, which exercises cpu hotplug vs. the filesystem code: smpboot: CPU 3 is now offline XFS (sda3): ctx ticket reservation ran out. Need to up reservation XFS (sda3): ticket reservation summary: XFS (sda3): unit res = 9268 bytes XFS (sda3): current res = -40 bytes XFS (sda3): original count = 1 XFS (sda3): remaining count = 1 XFS (sda3): Filesystem has been shut down due to log error (0x2). Applying the same sort of fix from 8b57b11cca88f to the CIL code seems to make the generic/650 problem go away, but I've been told that tglx was not happy when he saw: "...the only thing we actually need to care about is that percpu_counter_sum() iterates dying CPUs. That's trivial to do, and when there are no CPUs dying, it has no addition overhead except for a cpumask_or() operation." The CPU hotplug code is rather complex and difficult to understand and I don't want to try to understand the cpu hotplug locking well enough to use cpu_dying mask. Furthermore, there's a performance improvement that could be had here. Attach a private cpu mask to the CIL structure so that we can track exactly which cpus have accessed the percpu data at all. It doesn't matter if the cpu has since gone offline; log item aggregation will still find the items. Better yet, we skip cpus that have not recently logged anything. Worse yet, Ritesh Harjani and Eric Sandeen both reported today that CPU hot remove racing with an xfs mount can crash if the cpu_dead notifier tries to access the log but the mount hasn't yet set up the log. Link: https://lore.kernel.org/linux-xfs/ZOLzgBOuyWHapOyZ@dread.disaster.area/T/ Link: https://lore.kernel.org/lkml/877cuj1mt1.ffs@tglx/ Link: https://lore.kernel.org/lkml/20230414162755.281993820@linutronix.de/ Link: https://lore.kernel.org/linux-xfs/ZOVkjxWZq0YmjrJu@dread.disaster.area/T/ Cc: tglx@linutronix.de Cc: peterz@infradead.org Reported-by: ritesh.list@gmail.com Reported-by: sandeen@sandeen.net Fixes: af1c2146a50b ("xfs: introduce per-cpu CIL tracking structure") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
f798accd |
|
20-Sep-2023 |
Christian Brauner <brauner@kernel.org> |
Revert "xfs: switch to multigrain timestamps" This reverts commit e44df2664746aed8b6dd5245eb711a0ce33c5cf5. Users reported regressions due to enabling multi-grained timestamps unconditionally. As no clear consensus on a solution has come up and the discussion has gone back to the drawing board revert the infrastructure changes for. If it isn't code that's here to stay, make it go away. Message-ID: <20230920-keine-eile-c9755b5825db@brauner> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
d7a74cad |
|
10-Aug-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: track usage statistics of online fsck Track the usage, outcomes, and run times of the online fsck code, and report these values via debugfs. The columns in the file are: * scrubber name * number of scrub invocations * clean objects found * corruptions found * optimizations found * cross referencing failures * inconsistencies found during cross referencing * incomplete scrubs * warnings * number of time scrub had to retry * cumulative amount of time spent scrubbing (microseconds) * number of repair inovcations * successfully repaired objects * cumuluative amount of time spent repairing (microseconds) Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
a76dba3b |
|
10-Aug-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: create scaffolding for creating debugfs entries Set up debugfs directories for xfs as a whole, and a subdirectory for each mounted filesystem. This will enable the creation of debugfs files in the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
8ffa54e3 |
|
02-Aug-2023 |
Christoph Hellwig <hch@lst.de> |
xfs use fs_holder_ops for the log and RT devices Use the generic fs_holder_ops to shut down the file system when the log or RT device goes away instead of duplicating the logic. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Message-Id: <20230802154131.2221419-13-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
8d945b59 |
|
02-Aug-2023 |
Christoph Hellwig <hch@lst.de> |
xfs: drop s_umount over opening the log and RT devices Just like get_tree_bdev needs to drop s_umount when opening the main device, we need to do the same for the xfs log and RT devices to avoid a potential lock order reversal with s_unmount for the mark_dead path. It might be preferable to just drop s_umount over ->fill_super entirely, but that will require a fairly massive audit first, so we'll do the easy version here first. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Message-Id: <20230802154131.2221419-12-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
e44df266 |
|
07-Aug-2023 |
Jeff Layton <jlayton@kernel.org> |
xfs: switch to multigrain timestamps Enable multigrain timestamps, which should ensure that there is an apparent change to the timestamp whenever it has been written after being actively observed via getattr. Also, anytime the mtime changes, the ctime must also change, and those are now the only two options for xfs_trans_ichgtime. Have that function unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is always set. Acked-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Message-Id: <20230807-mgctime-v7-11-d1dec143a704@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
1a0a5dad |
|
09-Aug-2023 |
Christoph Hellwig <hch@lst.de> |
xfs: document the invalidate_bdev call in invalidate_bdev Copy and paste the commit message from Darrick into a comment to explain the seemingly odd invalidate_bdev in xfs_shutdown_devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Message-Id: <20230809220545.1308228-8-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
35a93b14 |
|
09-Aug-2023 |
Christoph Hellwig <hch@lst.de> |
xfs: close the external block devices in xfs_mount_free blkdev_put must not be called under sb->s_umount to avoid a lock order reversal with disk->open_mutex. Move closing the buftargs into ->kill_sb to archive that. Note that the flushing of the disk caches and block device mapping invalidated needs to stay in ->put_super as the main block device is closed in kill_block_super already. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Message-Id: <20230809220545.1308228-7-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
d3ef7e94 |
|
09-Aug-2023 |
Christoph Hellwig <hch@lst.de> |
xfs: remove xfs_blkdev_put There isn't much use for this trivial wrapper, especially as the NULL check is only needed in a single call site. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Message-Id: <20230809220545.1308228-5-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
2a9311ad |
|
09-Aug-2023 |
Christoph Hellwig <hch@lst.de> |
xfs: free the xfs_mount in ->kill_sb As a rule of thumb everything allocated to the fs_context and moved into the super_block should be freed by ->kill_sb so that the teardown handling doesn't need to be duplicated between the fill_super error path and put_super. Implement a XFS-specific kill_sb method to do that. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Message-Id: <20230809220545.1308228-4-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
1aa2d074 |
|
09-Aug-2023 |
Christoph Hellwig <hch@lst.de> |
xfs: remove a superfluous s_fs_info NULL check in xfs_fs_put_super ->put_super is only called when sb->s_root is set, and thus when fill_super succeeds. Thus drop the NULL check that can't happen in xfs_fs_put_super. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Message-Id: <20230809220545.1308228-3-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
dbbff489 |
|
09-Aug-2023 |
Christoph Hellwig <hch@lst.de> |
xfs: reformat the xfs_fs_free prototype The xfs_fs_free prototype formatting is a weird mix of the classic XFS style and the Linux style. Fix it up to be consistent. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Message-Id: <20230809220545.1308228-2-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
|
#
61d7e827 |
|
12-Jun-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: drop EXPERIMENTAL tag for large extent counts This feature has been baking in upstream for ~10mo with no bug reports. It seems to work fine here, let's get rid of the scary warnings? Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
05bdb996 |
|
08-Jun-2023 |
Christoph Hellwig <hch@lst.de> |
block: replace fmode_t with a block-specific type for block open flags The only overlap between the block open flags mapped into the fmode_t and other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and ->ioctl and stop abusing fmode_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
2736e8ee |
|
08-Jun-2023 |
Christoph Hellwig <hch@lst.de> |
block: use the holder as indication for exclusive opens The current interface for exclusive opens is rather confusing as it requires both the FMODE_EXCL flag and a holder. Remove the need to pass FMODE_EXCL and just key off the exclusive open off a non-NULL holder. For blkdev_put this requires adding the holder argument, which provides better debug checking that only the holder actually releases the hold, but at the same time allows removing the now superfluous mode argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
8067ca1d |
|
01-Jun-2023 |
Christoph Hellwig <hch@lst.de> |
xfs: wire up the ->mark_dead holder operation for log and RT devices Implement a set of holder_ops that shut down the file system when the block device used as log or RT device is removed undeneath the file system. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
e7caa877 |
|
01-Jun-2023 |
Christoph Hellwig <hch@lst.de> |
xfs: wire up sops->shutdown Wire up the shutdown method to shut down the file system when the underlying block device is marked dead. Add a new message to clearly distinguish this shutdown reason from other shutdowns. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
0718afd4 |
|
01-Jun-2023 |
Christoph Hellwig <hch@lst.de> |
block: introduce holder ops Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and installed in the block_device for exclusive claims. It will be used to allow the block layer to call back into the user of the block device for thing like notification of a removed device or a device resize. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
d4d12c02 |
|
04-Jun-2023 |
Dave Chinner <dchinner@redhat.com> |
xfs: collect errors from inodegc for unlinked inode recovery Unlinked list recovery requires errors removing the inode the from the unlinked list get fed back to the main recovery loop. Now that we offload the unlinking to the inodegc work, we don't get errors being fed back when we trip over a corruption that prevents the inode from being removed from the unlinked list. This means we never clear the corrupt unlinked list bucket, resulting in runtime operations eventually tripping over it and shutting down. Fix this by collecting inodegc worker errors and feed them back to the flush caller. This is largely best effort - the only context that really cares is log recovery, and it only flushes a single inode at a time so we don't need complex synchronised handling. Essentially the inodegc workers will capture the first error that occurs and the next flush will gather them and clear them. The flush itself will only report the first gathered error. In the cases where callers can return errors, propagate the collected inodegc flush error up the error handling chain. In the case of inode unlinked list recovery, there are several superfluous calls to flush queued unlinked inodes - xlog_recover_iunlink_bucket() guarantees that it has flushed the inodegc and collected errors before it returns. Hence nothing in the calling path needs to run a flush, even when an error is returned. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
b37c4c83 |
|
01-May-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: check that per-cpu inodegc workers actually run on that cpu Now that we've allegedly worked out the problem of the per-cpu inodegc workers being scheduled on the wrong cpu, let's put in a debugging knob to let us know if a worker ever gets mis-scheduled again. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
7ba83850 |
|
11-Apr-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: deprecate the ascii-ci feature This feature is a mess -- the hash function has been broken for the entire 15 years of its existence if you create names with extended ascii bytes; metadump name obfuscation has silently failed for just as long; and the feature clashes horribly with the UTF8 encodings that most systems use today. There is exactly one fstest for this feature. In other words, this feature is crap. Let's deprecate it now so we can remove it from the codebase in 2030. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
3cfb9290 |
|
16-Mar-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: test dir/attr hash when loading module Back in the 6.2-rc1 days, Eric Whitney reported a fstests regression in ext4 against generic/454. The cause of this test failure was the unfortunate combination of setting an xattr name containing UTF8 encoded emoji, an xattr hash function that accepted a char pointer with no explicit signedness, signed type extension of those chars to an int, and the 6.2 build tools maintainers deciding to mandate -funsigned-char across the board. As a result, the ondisk extended attribute structure written out by 6.1 and 6.2 were not the same. This discrepancy, in fact, had been noticeable if a filesystem with such an xattr were moved between any two architectures that don't employ the same signedness of a raw "char" declaration. The only reason anyone noticed is that x86 gcc defaults to signed, and no such -funsigned-char update was made to e2fsprogs, so e2fsck immediately started reporting data corruption. After a day and a half of discussing how to handle this use case (xattrs with bit 7 set anywhere in the name) without breaking existing users, Linus merged his own patch and didn't tell the maintainer. None of the ext4 developers realized this until AUTOSEL announced that the commit had been backported to stable. In the end, this problem could have been detected much earlier if there had been any useful tests of hash function(s) in use inside ext4 to make sure that they always produce the same outputs given the same inputs. The XFS dirent/xattr name hash takes a uint8_t*, so I don't think it's vulnerable to this problem. However, let's avoid all this drama by adding our own self test to check that the da hash produces the same outputs for a static pile of inputs on various platforms. This enables us to fix any breakage that may result in a controlled fashion. The buffer and test data are identical to the patches submitted to xfsprogs. Link: https://lore.kernel.org/linux-ext4/Y8bpkm3jA3bDm3eL@debian-BULLSEYE-live-builder-AMD64/ Link: https://lore.kernel.org/linux-xfs/ZBUKCRR7xvIqPrpX@destitution/T/#md38272cc684e2c0d61494435ccbb91f022e8dee4 Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
7ac2ff8b |
|
12-Feb-2023 |
Dave Chinner <dchinner@redhat.com> |
xfs: perags need atomic operational state We currently don't have any flags or operational state in the xfs_perag except for the pagf_init and pagi_init flags. And the agflreset flag. Oh, there's also the pagf_metadata and pagi_inodeok flags, too. For controlling per-ag operations, we are going to need some atomic state flags. Hence add an opstate field similar to what we already have in the mount and log, and convert all these state flags across to atomic bit operations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
|
#
20a5eab4 |
|
12-Feb-2023 |
Dave Chinner <dchinner@redhat.com> |
xfs: convert xfs_ialloc_next_ag() to an atomic This is currently a spinlock lock protected rotor which can be implemented with a single atomic operation. Change it to be more efficient and get rid of the m_agirotor_lock. Noticed while converting the inode allocation AG selection loop to active perag references. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
|
#
64c80dfd |
|
16-Nov-2022 |
Lukas Herbolt <lukas@herbolt.com> |
xfs: Print XFS UUID on mount and umount events. As of now only device names are printed out over __xfs_printk(). The device names are not persistent across reboots which in case of searching for origin of corruption brings another task to properly identify the devices. This patch add XFS UUID upon every mount/umount event which will make the identification much easier. Signed-off-by: Lukas Herbolt <lukas@herbolt.com> [sandeen: rebase onto current upstream kernel] Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
3c5aaace |
|
21-Oct-2022 |
Darrick J. Wong <djwong@kernel.org> |
xfs: refactor all the EFI/EFD log item sizeof logic Refactor all the open-coded sizeof logic for EFI/EFD log item and log format structures into common helper functions whose names reflect the struct names. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
03a7485c |
|
20-Oct-2022 |
Darrick J. Wong <djwong@kernel.org> |
xfs: fix memcpy fortify errors in EFI log format copying Starting in 6.1, CONFIG_FORTIFY_SOURCE checks the length parameter of memcpy. Since we're already fixing problems with BUI item copying, we should fix it everything else. An extra difficulty here is that the ef[id]_extents arrays are declared as single-element arrays. This is not the convention for flex arrays in the modern kernel, and it causes all manner of problems with static checking tools, since they often cannot tell the difference between a single element array and a flex array. So for starters, change those array[1] declarations to array[] declarations to signal that they are proper flex arrays and adjust all the "size-1" expressions to fit the new declaration style. Next, refactor the xfs_efi_copy_format function to handle the copying of the head and the flex array members separately. While we're at it, fix a minor validation deficiency in the recovery function. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
cbfecb92 |
|
24-Aug-2022 |
Lukas Czerner <lczerner@redhat.com> |
fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE Currently the I_DIRTY_TIME will never get set if the inode already has I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME. That's true, however ext4 will only update the on-disk inode in ->dirty_inode(), not on actual writeback. As a result if the inode already has I_DIRTY_INODE state by the time we get to __mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled into on-disk inode and will not get updated until the next I_DIRTY_INODE update, which might never come if we crash or get a power failure. The problem can be reproduced on ext4 by running xfstest generic/622 with -o iversion mount option. Fix it by allowing I_DIRTY_TIME to be set even if the inode already has I_DIRTY_INODE. Also make sure that the case is properly handled in writeback_single_inode() as well. Additionally changes in xfs_fs_dirty_inode() was made to accommodate for I_DIRTY_TIME in flag. Thanks Jan Kara for suggestions on how to make this work properly. Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: stable@kernel.org Signed-off-by: Lukas Czerner <lczerner@redhat.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220825100657.44217-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
#
35fcd75a |
|
09-Jun-2022 |
Shiyang Ruan <ruansy.fnst@fujitsu.com> |
xfs: fail dax mount if reflink is enabled on a partition Failure notification is not supported on partitions. So, when we mount a reflink enabled xfs on a partition with dax option, let it fail with -EINVAL code. Link: https://lkml.kernel.org/r/20220609143435.393724-1-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
#
231f91ab |
|
18-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: xfs_buf cache destroy isn't RCU safe Darrick and Sachin Sant reported that xfs/435 and xfs/436 would report an non-empty xfs_buf slab on module remove. This isn't easily to reproduce, but is clearly a side effect of converting the buffer caceh to RUC freeing and lockless lookups. Sachin bisected and Darrick hit it when testing the patchset directly. Turns out that the xfs_buf slab is not destroyed when all the other XFS slab caches are destroyed. Instead, it's got it's own little wrapper function that gets called separately, and so it doesn't have an rcu_barrier() call in it that is needed to drain all the rcu callbacks before the slab is destroyed. Fix it by removing the xfs_buf_init/terminate wrappers that just allocate and destroy the xfs_buf slab, and move them to the same place that all the other slab caches are set up and destroyed. Reported-and-tested-by: Sachin Sant <sachinp@linux.ibm.com> Fixes: 298f34224506 ("xfs: lockless buffer lookup") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
784eb7d8 |
|
13-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: add in-memory iunlink log item Now that we have a clean operation to update the di_next_unlinked field of inode cluster buffers, we can easily defer this operation to transaction commit time so we can order the inode cluster buffer locking consistently. To do this, we introduce a new in-memory log item to track the unlinked list item modification that we are going to make. This follows the same observations as the in-memory double linked list used to track unlinked inodes in that the inodes on the list are pinned in memory and cannot go away, and hence we can simply reference them for the duration of the transaction without needing to take active references or pin them or look them up. This allows us to pass the xfs_inode to the transaction commit code along with the modification to be made, and then order the logged modifications via the ->iop_sort and ->iop_precommit operations for the new log item type. As this is an in-memory log item, it doesn't have formatting, CIL or AIL operational hooks - it exists purely to run the inode unlink modifications and is then removed from the transaction item list and freed once the precommit operation has run. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
af1c2146 |
|
01-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: introduce per-cpu CIL tracking structure The CIL push lock is highly contended on larger machines, becoming a hard bottleneck that about 700,000 transaction commits/s on >16p machines. To address this, start moving the CIL tracking infrastructure to utilise per-CPU structures. We need to track the space used, the amount of log reservation space reserved to write the CIL, the log items in the CIL and the busy extents that need to be completed by the CIL commit. This requires a couple of per-cpu counters, an unordered per-cpu list and a globally ordered per-cpu list. Create a per-cpu structure to hold these and all the management interfaces needed, as well as the hooks to handle hotplug CPUs. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
|
#
5e672cd6 |
|
16-Jun-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: introduce xfs_inodegc_push() The current blocking mechanism for pushing the inodegc queue out to disk can result in systems becoming unusable when there is a long running inodegc operation. This is because the statfs() implementation currently issues a blocking flush of the inodegc queue and a significant number of common system utilities will call statfs() to discover something about the underlying filesystem. This can result in userspace operations getting stuck on inodegc progress, and when trying to remove a heavily reflinked file on slow storage with a full journal, this can result in delays measuring in hours. Avoid this problem by adding "push" function that expedites the flushing of the inodegc queue, but doesn't wait for it to complete. Convert xfs_fs_statfs() and xfs_qm_scall_getquota() to use this mechanism so they don't block but still ensure that queued operations are expedited. Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues") Reported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Dave Chinner <dchinner@redhat.com> [djwong: fix _getquota_next to use _inodegc_push too] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
7cf2b0f9 |
|
16-Jun-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: bound maximum wait time for inodegc work Currently inodegc work can sit queued on the per-cpu queue until the workqueue is either flushed of the queue reaches a depth that triggers work queuing (and later throttling). This means that we could queue work that waits for a long time for some other event to trigger flushing. Hence instead of just queueing work at a specific depth, use a delayed work that queues the work at a bound time. We can still schedule the work immediately at a given depth, but we no long need to worry about leaving a number of items on the list that won't get processed until external events prevail. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
d9c61ccb |
|
26-May-2022 |
Darrick J. Wong <djwong@kernel.org> |
xfs: move xfs_attr_use_log_assist out of xfs_log.c The LARP patchset added an awkward coupling point between libxfs and what would be libxlog, if the XFS log were actually its own library. Move the code that enables logged xattr updates out of "lib"xlog and into xfs_xattr.c so that it no longer has to know about xlog_* functions. While we're at it, give xfs_xattr.c its own header file. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
4136e38a |
|
21-May-2022 |
Darrick J. Wong <djwong@kernel.org> |
xfs: put attr[id] log item cache init with the others Initialize and destroy the xattr log item caches in the same places that we do all the other log item caches. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
973ac0eb |
|
10-Aug-2021 |
Chandan Babu R <chandan.babu@oracle.com> |
xfs: Add XFS_SB_FEAT_INCOMPAT_NREXT64 to the list of supported flags This commit enables XFS module to work with fs instances having 64-bit per-inode extent counters by adding XFS_SB_FEAT_INCOMPAT_NREXT64 flag to the list of supported incompat feature flags. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
|
#
2229276c |
|
11-Apr-2022 |
Darrick J. Wong <djwong@kernel.org> |
xfs: use a separate frextents counter for rt extent reservations As mentioned in the previous commit, the kernel misuses sb_frextents in the incore mount to reflect both incore reservations made by running transactions as well as the actual count of free rt extents on disk. This results in the superblock being written to the log with an underestimate of the number of rt extents that are marked free in the rtbitmap. Teaching XFS to recompute frextents after log recovery avoids operational problems in the current mount, but it doesn't solve the problem of us writing undercounted frextents which are then recovered by an older kernel that doesn't have that fix. Create an incore percpu counter to mirror the ondisk frextents. This new counter will track transaction reservations and the only time we will touch the incore super counter (i.e the one that gets logged) is when those transactions commit updates to the rt bitmap. This is in contrast to the lazysbcount counters (e.g. fdblocks), where we know that log recovery will always fix any incorrect counter that we log. As a bonus, we only take m_sb_lock at transaction commit time. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
70200574 |
|
14-Apr-2022 |
Christoph Hellwig <hch@lst.de> |
block: remove QUEUE_FLAG_DISCARD Just use a non-zero max_discard_sectors as an indicator for discard support, similar to what is done for write zeroes. The only places where needs special attention is the RAID5 driver, which must clear discard support for security reasons by default, even if the default stacking rules would allow for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Jan Höppner <hoeppner@linux.ibm.com> [s390] Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-25-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
85bcfa26 |
|
16-Mar-2022 |
Darrick J. Wong <djwong@kernel.org> |
xfs: don't report reserved bnobt space as available On a modern filesystem, we don't allow userspace to allocate blocks for data storage from the per-AG space reservations, the user-controlled reservation pool that prevents ENOSPC in the middle of internal operations, or the internal per-AG set-aside that prevents unwanted filesystem shutdowns due to ENOSPC during a bmap btree split. Since we now consider freespace btree blocks as unavailable for allocation for data storage, we shouldn't report those blocks via statfs either. This makes the numbers that we return via the statfs f_bavail and f_bfree fields a more conservative estimate of actual free space. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
b97cca3b |
|
03-Feb-2022 |
Darrick J. Wong <djwong@kernel.org> |
xfs: only bother with sync_filesystem during readonly remount In commit 02b9984d6408, we pushed a sync_filesystem() call from the VFS into xfs_fs_remount. The only time that we ever need to push dirty file data or metadata to disk for a remount is if we're remounting the filesystem read only, so this really could be moved to xfs_remount_ro. Once we've moved the call site, actually check the return value from sync_filesystem. Fixes: 02b9984d6408 ("fs: push sync_filesystem() down to the file system's remount_fs()") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
2d86293c |
|
30-Jan-2022 |
Darrick J. Wong <djwong@kernel.org> |
xfs: return errors in xfs_fs_sync_fs Now that the VFS will do something with the return values from ->sync_fs, make ours pass on error codes. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner <brauner@kernel.org>
|
#
5b5abbef |
|
29-Nov-2021 |
Christoph Hellwig <hch@lst.de> |
xfs: move dax device handling into xfs_{alloc,free}_buftarg Hide the DAX device lookup from the xfs_super.c code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/20211129102203.2243509-22-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
#
7b0800d0 |
|
29-Nov-2021 |
Christoph Hellwig <hch@lst.de> |
dax: remove dax_capable Just open code the block size and dax_dev == NULL checks in the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> [erofs] Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20211129102203.2243509-9-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
#
679a9949 |
|
29-Nov-2021 |
Christoph Hellwig <hch@lst.de> |
xfs: factor out a xfs_setup_dax_always helper Factor out another DAX setup helper to simplify future changes. Also move the experimental warning after the checks to not clutter the log too much if the setup failed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20211129102203.2243509-8-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
#
7993f1a4 |
|
15-Dec-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: only run COW extent recovery when there are no live extents As part of multiple customer escalations due to file data corruption after copy on write operations, I wrote some fstests that use fsstress to hammer on COW to shake things loose. Regrettably, I caught some filesystem shutdowns due to incorrect rmap operations with the following loop: mount <filesystem> # (0) fsstress <run only readonly ops> & # (1) while true; do fsstress <run all ops> mount -o remount,ro # (2) fsstress <run only readonly ops> mount -o remount,rw # (3) done When (2) happens, notice that (1) is still running. xfs_remount_ro will call xfs_blockgc_stop to walk the inode cache to free all the COW extents, but the blockgc mechanism races with (1)'s reader threads to take IOLOCKs and loses, which means that it doesn't clean them all out. Call such a file (A). When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which walks the ondisk refcount btree and frees any COW extent that it finds. This function does not check the inode cache, which means that incore COW forks of inode (A) is now inconsistent with the ondisk metadata. If one of those former COW extents are allocated and mapped into another file (B) and someone triggers a COW to the stale reservation in (A), A's dirty data will be written into (B) and once that's done, those blocks will be transferred to (A)'s data fork without bumping the refcount. The results are catastrophic -- file (B) and the refcount btree are now corrupt. In the first patch, we fixed the race condition in (2) so that (A) will always flush the COW fork. In this second patch, we move the _recover_cow call to the initial mount call in (0) for safety. As mentioned previously, xfs_reflink_recover_cow walks the refcount btree looking for COW staging extents, and frees them. This was intended to be run at mount time (when we know there are no live inodes) to clean up any leftover staging events that may have been left behind during an unclean shutdown. As a time "optimization" for readonly mounts, we deferred this to the ro->rw transition, not realizing that any failure to clean all COW forks during a rw->ro transition would result in catastrophic corruption. Therefore, remove this optimization and only run the recovery routine when we're guaranteed not to have any COW staging extents anywhere, which means we always run this at mount time. While we're at it, move the callsite to xfs_log_mount_finish because any refcount btree expansion (however unlikely given that we're removing records from the right side of the index) must be fed by a per-AG reservation, which doesn't exist in its current location. Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
089558bc |
|
06-Dec-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: remove all COW fork extents when remounting readonly As part of multiple customer escalations due to file data corruption after copy on write operations, I wrote some fstests that use fsstress to hammer on COW to shake things loose. Regrettably, I caught some filesystem shutdowns due to incorrect rmap operations with the following loop: mount <filesystem> # (0) fsstress <run only readonly ops> & # (1) while true; do fsstress <run all ops> mount -o remount,ro # (2) fsstress <run only readonly ops> mount -o remount,rw # (3) done When (2) happens, notice that (1) is still running. xfs_remount_ro will call xfs_blockgc_stop to walk the inode cache to free all the COW extents, but the blockgc mechanism races with (1)'s reader threads to take IOLOCKs and loses, which means that it doesn't clean them all out. Call such a file (A). When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which walks the ondisk refcount btree and frees any COW extent that it finds. This function does not check the inode cache, which means that incore COW forks of inode (A) is now inconsistent with the ondisk metadata. If one of those former COW extents are allocated and mapped into another file (B) and someone triggers a COW to the stale reservation in (A), A's dirty data will be written into (B) and once that's done, those blocks will be transferred to (A)'s data fork without bumping the refcount. The results are catastrophic -- file (B) and the refcount btree are now corrupt. Solve this race by forcing the xfs_blockgc_free_space to run synchronously, which causes xfs_icwalk to return to inodes that were skipped because the blockgc code couldn't take the IOLOCK. This is safe to do here because the VFS has already prohibited new writer threads. Fixes: 10ddf64e420f ("xfs: remove leftover CoW reservations when remounting ro") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
|
#
0b9007ec |
|
25-Oct-2021 |
Wan Jiabing <wanjiabing@vivo.com> |
xfs: Remove duplicated include in xfs_super Fix following checkincludes.pl warning: ./fs/xfs/xfs_super.c: xfs_btree.h is included more than once. The include is in line 15. Remove the duplicated here. Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
c201d9ca |
|
12-Oct-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: rename xfs_bmap_add_free to xfs_free_extent_later xfs_bmap_add_free isn't a block mapping function; it schedules deferred freeing operations for a later point in a compound transaction chain. While it's primarily used by bunmapi, its use has expanded beyond that. Move it to xfs_alloc.c and rename the function since it's now general freeing functionality. Bring the slab cache bits in line with the way we handle the other intent items. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
|
#
f3c799c2 |
|
12-Oct-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: create slab caches for frequently-used deferred items Create slab caches for the high-level structures that coordinate deferred intent items, since they're used fairly heavily. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
|
#
182696fb |
|
12-Oct-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: rename _zone variables to _cache Now that we've gotten rid of the kmem_zone_t typedef, rename the variables to _cache since that's what they are. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
|
#
9fa47bdc |
|
23-Sep-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: use separate btree cursor cache for each btree type Now that we have the infrastructure to track the max possible height of each btree type, we can create a separate slab cache for cursors of each type of btree. For smaller indices like the free space btrees, this means that we can pack more cursors into a slab page, improving slab utilization. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
c940a0c5 |
|
16-Sep-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: dynamically allocate cursors based on maxlevels To support future btree code, we need to be able to size btree cursors dynamically for very large btrees. Switch the maxlevels computation to use the precomputed values in the superblock, and create cursors that can handle a certain height. For now, we retain the btree cursor cache that can handle up to 9-level btrees, though a subsequent patch introduces separate caches for each btree type, where each cache's objects will be exactly tall enough to handle the specific btree type. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
6ca444cf |
|
16-Sep-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: prepare xfs_btree_cur for dynamic cursor heights Split out the btree level information into a separate struct and put it at the end of the cursor structure as a VLA. Files with huge data forks (and in the future, the realtime rmap btree) will require the ability to support many more levels than a per-AG btree cursor, which means that we're going to create per-btree type cursor caches to conserve memory for the more common case. Note that a subsequent patch actually introduces dynamic cursor heights. This one merely rearranges the structure to prepare for that. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
bdd3c50d |
|
26-Aug-2021 |
Christoph Hellwig <hch@lst.de> |
dax: remove bdev_dax_supported All callers already have a dax_device obtained from fs_dax_get_by_bdev at hand, so just pass that to dax_supported() insted of doing another lookup. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/20210826135510.6293-10-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
#
a384f088 |
|
26-Aug-2021 |
Christoph Hellwig <hch@lst.de> |
xfs: factor out a xfs_buftarg_is_dax helper Refactor the DAX setup code in preparation of removing bdev_dax_supported. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20210826135510.6293-9-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
#
d6837c1a |
|
18-Aug-2021 |
Dave Chinner <dchinner@redhat.com> |
xfs: introduce xfs_sb_is_v5 helper Rather than open coding XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5 checks everywhere, add a simple wrapper to encapsulate this and make the code easier to read. This allows us to remove the xfs_sb_version_has_v3inode() wrapper which is only used in xfs_format.h now and is just a version number check. There are a couple of places where we should be checking the mount feature bits rather than the superblock version (e.g. remount), so those are converted to use xfs_has_crc(mp) instead. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
ebd9027d |
|
18-Aug-2021 |
Dave Chinner <dchinner@redhat.com> |
xfs: convert xfs_sb_version_has checks to use mount features This is a conversion of the remaining xfs_sb_version_has..(sbp) checks to use xfs_has_..(mp) feature checks. This was largely done with a vim replacement macro that did: :0,$s/xfs_sb_version_has\(.*\)&\(.*\)->m_sb/xfs_has_\1\2/g<CR> A couple of other variants were also used, and the rest touched up by hand. $ size -t fs/xfs/built-in.a text data bss dec hex filename before 1127533 311352 484 1439369 15f689 (TOTALS) after 1125360 311352 484 1437196 15ee0c (TOTALS) Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
75c8c50f |
|
18-Aug-2021 |
Dave Chinner <dchinner@redhat.com> |
xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdown Remove the shouty macro and instead use the inline function that matches other state/feature check wrapper naming. This conversion was done with sed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
2e973b2c |
|
18-Aug-2021 |
Dave Chinner <dchinner@redhat.com> |
xfs: convert remaining mount flags to state flags The remaining mount flags kept in m_flags are actually runtime state flags. These change dynamically, so they really should be updated atomically so we don't potentially lose an update due to racing modifications. Convert these remaining flags to be stored in m_opstate and use atomic bitops to set and clear the flags. This also adds a couple of simple wrappers for common state checks - read only and shutdown. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
0560f31a |
|
18-Aug-2021 |
Dave Chinner <dchinner@redhat.com> |
xfs: convert mount flags to features Replace m_flags feature checks with xfs_has_<feature>() calls and rework the setup code to set flags in m_features. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
38c26bfd |
|
18-Aug-2021 |
Dave Chinner <dchinner@redhat.com> |
xfs: replace xfs_sb_version checks with feature flag checks Convert the xfs_sb_version_hasfoo() to checks against mp->m_features. Checks of the superblock itself during disk operations (e.g. in the read/write verifiers and the to/from disk formatters) are not converted - they operate purely on the superblock state. Everything else should use the mount features. Large parts of this conversion were done with sed with commands like this: for f in `git grep -l xfs_sb_version_has fs/xfs/*.c`; do sed -i -e 's/xfs_sb_version_has\(.*\)(&\(.*\)->m_sb)/xfs_has_\1(\2)/' $f done With manual cleanups for things like "xfs_has_extflgbit" and other little inconsistencies in naming. The result is ia lot less typing to check features and an XFS binary size reduced by a bit over 3kB: $ size -t fs/xfs/built-in.a text data bss dec hex filenam before 1130866 311352 484 1442702 16038e (TOTALS) after 1127727 311352 484 1439563 15f74b (TOTALS) Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
e23b55d5 |
|
18-Aug-2021 |
Dave Chinner <dchinner@redhat.com> |
xfs: rework attr2 feature and mount options The attr2 feature is somewhat unique in that it has both a superblock feature bit to enable it and mount options to enable and disable it. Back when it was first introduced in 2005, attr2 was disabled unless either the attr2 superblock feature bit was set, or the attr2 mount option was set. If the superblock feature bit was not set but the mount option was set, then when the first attr2 format inode fork was created, it would set the superblock feature bit. This is as it should be - the superblock feature bit indicated the presence of the attr2 on disk format. The noattr2 mount option, however, did not affect the superblock feature bit. If noattr2 was specified, the on-disk superblock feature bit was ignored and the code always just created attr1 format inode forks. If neither of the attr2 or noattr2 mounts option were specified, then the behaviour was determined by the superblock feature bit. This was all pretty sane. Fast foward 3 years, and we are dealing with fallout from the botched sb_features2 addition and having to deal with feature mismatches between the sb_features2 and sb_bad_features2 fields. The attr2 feature bit was one of these flags. The reconciliation was done well after mount option parsing and, unfortunately, the feature reconciliation had a bug where it ignored the noattr2 mount option. For reasons lost to the mists of time, it was decided that resolving this issue in commit 7c12f296500e ("[XFS] Fix up noattr2 so that it will properly update the versionnum and features2 fields.") required noattr2 to clear the superblock attr2 feature bit. This greatly complicated the attr2 behaviour and broke rules about feature bits needing to be set when those specific features are present in the filesystem. By complicated, I mean that it introduced problems due to feature bit interactions with log recovery. All of the superblock feature bit checks are done prior to log recovery, but if we crash after removing a feature bit, then on the next mount we see the feature bit in the unrecovered superblock, only to have it go away after the log has been replayed. This means our mount time feature processing could be all wrong. Hence you can mount with noattr2, crash shortly afterwards, and mount again without attr2 or noattr2 and still have attr2 enabled because the second mount sees attr2 still enabled in the superblock before recovery runs and removes the feature bit. It's just a mess. Further, this is all legacy code as the v5 format requires attr2 to be enabled at all times and it cannot be disabled. i.e. the noattr2 mount option returns an error when used on v5 format filesystems. To straighten this all out, this patch reverts the attr2/noattr2 mount option behaviour back to the original behaviour. There is no reason for disabling attr2 these days, so we will only do this when the noattr2 mount option is set. This will not remove the superblock feature bit. The superblock bit will provide the default behaviour and only track whether attr2 is present on disk or not. The attr2 mount option will enable the creation of attr2 format inode forks, and if the superblock feature bit is not set it will be added when the first attr2 inode fork is created. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
33c0dd78 |
|
10-Aug-2021 |
Dave Chinner <dchinner@redhat.com> |
xfs: move the CIL workqueue to the CIL We only use the CIL workqueue in the CIL, so it makes no sense to hang it off the xfs_mount and have to walk multiple pointers back up to the mount when we have the CIL structures right there. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
39823d0f |
|
10-Aug-2021 |
Dave Chinner <dchinner@redhat.com> |
xfs: CIL work is serialised, not pipelined Because we use a single work structure attached to the CIL rather than the CIL context, we can only queue a single work item at a time. This results in the CIL being single threaded and limits performance when it becomes CPU bound. The design of the CIL is that it is pipelined and multiple commits can be running concurrently, but the way the work is currently implemented means that it is not pipelining as it was intended. The critical work to switch the CIL context can take a few milliseconds to run, but the rest of the CIL context flush can take hundreds of milliseconds to complete. The context switching is the serialisation point of the CIL, once the context has been switched the rest of the context push can run asynchrnously with all other context pushes. Hence we can move the work to the CIL context so that we can run multiple CIL pushes at the same time and spread the majority of the work out over multiple CPUs. We can keep the per-cpu CIL commit state on the CIL rather than the context, because the context is pinned to the CIL until the switch is done and we aggregate and drain the per-cpu state held on the CIL during the context switch. However, because we no longer serialise the CIL work, we can have effectively unlimited CIL pushes in progress. We don't want to do this - not only does it create contention on the iclogs and the state machine locks, we can run the log right out of space with outstanding pushes. Instead, limit the work concurrency to 4 concurrent works being processed at a time. This is enough concurrency to remove the CIL from being a CPU bound bottleneck but not enough to create new contention points or unbound concurrency issues. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
e1d06e5f |
|
10-Aug-2021 |
Dave Chinner <dchinner@redhat.com> |
xfs: convert log flags to an operational state field log->l_flags doesn't actually contain "flags" as such, it contains operational state information that can change at runtime. For the shutdown state, this at least should be an atomic bit because it is read without holding locks in many places and so using atomic bitops for the state field modifications makes sense. This allows us to use things like test_and_set_bit() on state changes (e.g. setting XLOG_TAIL_WARN) to avoid races in setting the state when we aren't holding locks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
f19ee6bb |
|
06-Aug-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: drop experimental warnings for bigtime and inobtcount These two features were merged a year ago, userspace tooling have been merged, and no serious errors have been reported by the developers. Drop the experimental tag to encourage wider testing. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
6f649091 |
|
06-Aug-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: don't run speculative preallocation gc when fs is frozen Now that we have the infrastructure to switch background workers on and off at will, fix the block gc worker code so that we don't actually run the worker when the filesystem is frozen, same as we do for deferred inactivation. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
01e8f379 |
|
06-Aug-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: flush inode inactivation work when compiling usage statistics Users have come to expect that the space accounting information in statfs and getquota reports are fairly accurate. Now that we inactivate inodes from a background queue, these numbers can be thrown off by whatever resources are singly-owned by the inodes in the queue. Flush the pending inactivations when userspace asks for a space usage report. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
ab23a776 |
|
06-Aug-2021 |
Dave Chinner <dchinner@redhat.com> |
xfs: per-cpu deferred inode inactivation queues Move inode inactivation to background work contexts so that it no longer runs in the context that releases the final reference to an inode. This will allow process work that ends up blocking on inactivation to continue doing work while the filesytem processes the inactivation in the background. A typical demonstration of this is unlinking an inode with lots of extents. The extents are removed during inactivation, so this blocks the process that unlinked the inode from the directory structure. By moving the inactivation to the background process, the userspace applicaiton can keep working (e.g. unlinking the next inode in the directory) while the inactivation work on the previous inode is done by a different CPU. The implementation of the queue is relatively simple. We use a per-cpu lockless linked list (llist) to queue inodes for inactivation without requiring serialisation mechanisms, and a work item to allow the queue to be processed by a CPU bound worker thread. We also keep a count of the queue depth so that we can trigger work after a number of deferred inactivations have been queued. The use of a bound workqueue with a single work depth allows the workqueue to run one work item per CPU. We queue the work item on the CPU we are currently running on, and so this essentially gives us affine per-cpu worker threads for the per-cpu queues. THis maintains the effective CPU affinity that occurs within XFS at the AG level due to all objects in a directory being local to an AG. Hence inactivation work tends to run on the same CPU that last accessed all the objects that inactivation accesses and this maintains hot CPU caches for unlink workloads. A depth of 32 inodes was chosen to match the number of inodes in an inode cluster buffer. This hopefully allows sequential allocation/unlink behaviours to defering inactivation of all the inodes in a single cluster buffer at a time, further helping maintain hot CPU and buffer cache accesses while running inactivations. A hard per-cpu queue throttle of 256 inode has been set to avoid runaway queuing when inodes that take a long to time inactivate are being processed. For example, when unlinking inodes with large numbers of extents that can take a lot of processing to free. Signed-off-by: Dave Chinner <dchinner@redhat.com> [djwong: tweak comments and tracepoints, convert opflags to state bits] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
c6c2066d |
|
06-Aug-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: move xfs_inactive call to xfs_inode_mark_reclaimable Move the xfs_inactive call and all the other debugging checks and stats updates into xfs_inode_mark_reclaimable because most of that are implementation details about the inode cache. This is preparation for deferred inactivation that is coming up. We also move it around xfs_icache.c in preparation for deferred inactivation. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
0ed17f01 |
|
06-Aug-2021 |
Dave Chinner <dchinner@redhat.com> |
xfs: introduce all-mounts list for cpu hotplug notifications The inode inactivation and CIL tracking percpu structures are per-xfs_mount structures. That means when we get a CPU dead notification, we need to then iterate all the per-cpu structure instances to process them. Rather than keeping linked lists of per-cpu structures in each subsystem, add a list of all xfs_mounts that the generic xfs_cpu_dead() function will iterate and call into each subsystem appropriately. This allows us to handle both per-mount and global XFS percpu state from xfs_cpu_dead(), and avoids the need to link subsystem structures that can be easily found from the xfs_mount into their own global lists. Signed-off-by: Dave Chinner <dchinner@redhat.com> [djwong: expand some comments about mount list setup ordering rules] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
f1653c2e |
|
06-Aug-2021 |
Dave Chinner <dchinner@redhat.com> |
xfs: introduce CPU hotplug infrastructure We need to move to per-cpu state for both deferred inode inactivation and CIL tracking, but to do that we need to handle CPUs being removed from the system by the hot-plug code. Introduce generic XFS infrastructure to handle CPU hotplug events that is set up at module init time and torn down at module exit time. Initially, we only need CPU dead notifications, so we only set up a callback for these notifications. The infrastructure can be updated in future for other CPU hotplug state machine notifications easily if ever needed. Signed-off-by: Dave Chinner <dchinner@redhat.com> [djwong: rearrange some macros, fix function prototypes] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
149e53af |
|
06-Aug-2021 |
Christoph Hellwig <hch@lst.de> |
xfs: remove the active vs running quota differentiation These only made a difference when quotaoff supported disabling quota accounting on a mounted file system, so we can switch everyone to use a single set of flags and helpers now. Note that the *QUOTA_ON naming for the helpers is kept as it was the much more commonly used one. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
2433480a |
|
12-Apr-2021 |
Jan Kara <jack@suse.cz> |
xfs: Convert to use invalidate_lock Use invalidate_lock instead of XFS internal i_mmap_lock. The intended purpose of invalidate_lock is exactly the same. Note that the locking in __xfs_filemap_fault() slightly changes as filemap_fault() already takes invalidate_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> CC: <linux-xfs@vger.kernel.org> CC: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz>
|
#
b5071ada |
|
18-Jun-2021 |
Dave Chinner <dchinner@redhat.com> |
xfs: remove xfs_blkdev_issue_flush It's a one line wrapper around blkdev_issue_flush(). Just replace it with direct calls to blkdev_issue_flush(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
c076ae7a |
|
31-May-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: refactor per-AG inode tagging functions In preparation for adding another incore inode tree tag, refactor the code that sets and clears tags from the per-AG inode tree and the tree of per-AG structures, and remove the open-coded versions used by the blockgc code. Note: For reclaim, we now rely on the radix tree tags instead of the reclaimable inode count more heavily than we used to. The conversion should be fine, but the logic isn't 100% identical. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
9bbafc71 |
|
01-Jun-2021 |
Dave Chinner <dchinner@redhat.com> |
xfs: move xfs_perag_get/put to xfs_ag.[ch] They are AG functions, not superblock functions, so move them to the appropriate location. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
|
#
db07349d |
|
29-Mar-2021 |
Christoph Hellwig <hch@lst.de> |
xfs: move the di_flags field to struct xfs_inode In preparation of removing the historic icinode struct, move the flags field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
25dfa65f |
|
25-Mar-2021 |
Anthony Iliopoulos <ailiop@suse.com> |
xfs: fix xfs_trans slab cache name Removal of kmem_zone_init wrappers accidentally changed a slab cache name from "xfs_trans" to "xf_trans". Fix this so that userspace consumers of /proc/slabinfo and /sys/kernel/slab can find it again. Fixes: b1231760e443 ("xfs: Remove slab init wrappers") Signed-off-by: Anthony Iliopoulos <ailiop@suse.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
92cf7d36 |
|
22-Mar-2021 |
Pavel Reichl <preichl@redhat.com> |
xfs: Skip repetitive warnings about mount options Skip the warnings about mount option being deprecated if we are remounting and deprecated option state is not changing. Bug: https://bugzilla.kernel.org/show_bug.cgi?id=211605 Fix-suggested-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Pavel Reichl <preichl@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
0f98b4ec |
|
22-Mar-2021 |
Pavel Reichl <preichl@redhat.com> |
xfs: rename variable mp to parsing_mp Rename mp variable to parsisng_mp so it is easy to distinguish between current mount point handle and handle for mount point which mount options are being parsed. Suggested-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Pavel Reichl <preichl@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
3fef46fc |
|
22-Mar-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: rename the blockgc workqueue Since we're about to start using the blockgc workqueue to dispose of inactivated inodes, strip the "block" prefix from the name; now it's merely the general garbage collection (gc) workqueue. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
f736d93d |
|
21-Jan-2021 |
Christoph Hellwig <hch@lst.de> |
xfs: support idmapped mounts Enable idmapped mounts for xfs. This basically just means passing down the user_namespace argument from the VFS methods down to where it is passed to the relevant helpers. Note that full-filesystem bulkstat is not supported from inside idmapped mounts as it is an administrative operation that acts on the whole file system. The limitation is not applied to the bulkstat single operation that just operates on a single inode. Link: https://lore.kernel.org/r/20210121131959.646623-40-christian.brauner@ubuntu.com Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
|
#
47bd6d34 |
|
25-Jan-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: expose the blockgc workqueue knobs publicly Expose the workqueue sysfs knobs for the speculative preallocation gc workers on all kernels, and update the sysadmin information. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
894ecacf |
|
22-Jan-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: parallelize block preallocation garbage collection Split the block preallocation garbage collection work into per-AG work items so that we can take advantage of parallelization. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
c9a6526f |
|
22-Jan-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: rename block gc start and stop functions Shorten the names of the two functions that start and stop block preallocation garbage collection and move them up to the other blockgc functions. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
9669f51d |
|
22-Jan-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: consolidate the eofblocks and cowblocks workers Remove the separate cowblocks work items and knob so that we can control and run everything from a single blockgc work queue. Note that the speculative_prealloc_lifetime sysfs knob retains its historical name even though the functions move to prefix xfs_blockgc_*. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
b943c0cd |
|
22-Jan-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: hide xfs_icache_free_cowblocks Change the one remaining caller of xfs_icache_free_cowblocks to use our new combined blockgc scan function instead, since we will soon be combining the two scans. This introduces a slight behavior change, since a readonly remount now clears out post-EOF preallocations and not just CoW staging extents. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
05a302a1 |
|
22-Jan-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: set WQ_SYSFS on all workqueues in debug mode When CONFIG_XFS_DEBUG=y, set WQ_SYSFS on all workqueues that we create so that we (developers) have a means to monitor cpu affinity and whatnot for background workers. In the next patchset we'll expose knobs for more of the workqueues publicly and document it, but not now. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
|
#
c6bf3f0e |
|
26-Jan-2021 |
Christoph Hellwig <hch@lst.de> |
block: use an on-stack bio in blkdev_issue_flush There is no point in allocating memory for a synchronous flush. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
5b0ad7c2 |
|
22-Jan-2021 |
Brian Foster <bfoster@redhat.com> |
xfs: cover the log on freeze instead of cleaning it Filesystem freeze cleans the log and immediately redirties it so log recovery runs if a crash occurs after the filesystem is frozen. Now that log quiesce covers the log, there is no need to clean the log and redirty it to trigger log recovery because covering has the same effect. Update xfs_fs_freeze() to quiesce (and thus cover) the log. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
|
#
ea2064da |
|
22-Jan-2021 |
Brian Foster <bfoster@redhat.com> |
xfs: remove xfs_quiesce_attr() xfs_quiesce_attr() is now a wrapper for xfs_log_clean(). Remove it and call xfs_log_clean() directly. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
|
#
5232b931 |
|
22-Jan-2021 |
Brian Foster <bfoster@redhat.com> |
xfs: remove duplicate wq cancel and log force from attr quiesce These two calls are repeated at the beginning of xfs_log_quiesce(). Drop them from xfs_quiesce_attr(). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
|
#
f46e5a17 |
|
22-Jan-2021 |
Brian Foster <bfoster@redhat.com> |
xfs: fold sbcount quiesce logging into log covering xfs_log_sbcount() calls xfs_sync_sb() to sync superblock counters to disk when lazy superblock accounting is enabled. This occurs on unmount, freeze, and read-only (re)mount and ensures the final values are calculated and persisted to disk before each form of quiesce completes. Now that log covering occurs in all of these contexts and uses the same xfs_sync_sb() mechanism to update log state, there is no need to log the superblock separately for any reason. Update the log quiesce path to sync the superblock at least once for any mount where lazy superblock accounting is enabled. If the log is already covered, it will remain in the covered state. Otherwise, the next sync as part of the normal covering sequence will carry the associated superblock update with it. Remove xfs_log_sbcount() now that it is no longer needed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
|
#
9e54ee0f |
|
22-Jan-2021 |
Brian Foster <bfoster@redhat.com> |
xfs: separate log cleaning from log quiesce Log quiesce is currently associated with cleaning the log, which is accomplished by writing an unmount record as the last step of the quiesce sequence. The quiesce codepath is a bit convoluted in this regard due to how it is reused from various contexts. In preparation to create separate log cleaning and log covering interfaces, lift the write of the unmount record into a new cleaning helper and call that wherever xfs_log_quiesce() is currently invoked. No functional changes. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
237d7887 |
|
03-Dec-2020 |
Kaixu Xia <kaixuxia@tencent.com> |
xfs: show the proper user quota options The quota option 'usrquota' should be shown if both the XFS_UQUOTA_ACCT and XFS_UQUOTA_ENFD flags are set. The option 'uqnoenforce' should be shown when only the XFS_UQUOTA_ACCT flag is set. The current code logic seems wrong, Fix it and show proper options. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
1e5c39df |
|
04-Dec-2020 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: rename xfs_fc_* back to xfs_fs_* Get rid of this one-off namespace since we're done converting things to fscontext now. Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
|
#
33005fd0 |
|
04-Dec-2020 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: refactor file range validation Refactor all the open-coded validation of file block ranges into a single helper, and teach the bmap scrubber to check the ranges. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
80c720b8 |
|
24-Nov-2020 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: define a new "needrepair" feature Define an incompat feature flag to indicate that the filesystem needs to be repaired. While libxfs will recognize this feature, the kernel will refuse to mount if the feature flag is set, and only xfs_repair will be able to clear the flag. The goal here is to force the admin to run xfs_repair to completion after upgrading the filesystem, or if we otherwise detect anomalies. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
|
#
3945ae03 |
|
24-Nov-2020 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: move kernel-specific superblock validation out of libxfs A couple of the superblock validation checks apply only to the kernel, so move them to xfs_fc_fill_super before we add the needsrepair "feature", which will prevent the kernel (but not xfsprogs) from mounting the filesystem. This also reduces the diff between kernel and userspace libxfs. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
|
#
c23c393e |
|
25-Sep-2020 |
Pavel Reichl <preichl@redhat.com> |
xfs: remove deprecated mount options ikeep/noikeep was a workaround for old DMAPI code which is no longer relevant. attr2/noattr2 - is for controlling upgrade behaviour from fixed attribute fork sizes in the inode (attr1) and dynamic attribute fork sizes (attr2). mkfs has defaulted to setting attr2 since 2007, hence just about every XFS filesystem out there in production right now uses attr2. Signed-off-by: Pavel Reichl <preichl@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: fix minor typos] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
6d1349c7 |
|
18-Sep-2020 |
Al Viro <viro@zeniv.linux.org.uk> |
[PATCH] reduce boilerplate in fsid handling Get rid of boilerplate in most of ->statfs() instances... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
b96cb835 |
|
10-Sep-2020 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: deprecate the V4 format The V4 filesystem format contains known weaknesses in the on-disk format that make metadata verification diffiult. In addition, the format does not support dates past 2038 and will not be upgraded to do so. We should start the process of retiring the old format to close off attack surfaces and to encourage users to migrate onto V5. Therefore, make XFS V4 support a configurable option. For the first period it will be default Y in case some distributors want to withdraw support early; for the second period it will be default N so that anyone who wishes to continue support can do so; and after that, support will be removed from the kernel. Dates for these events have been added to the upstream kernel. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
|
#
06dbf82b |
|
24-Aug-2020 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: trace timestamp limits Add a couple of tracepoints so that we can check the timestamp limits being set on inodes and quotas. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
f93e5436 |
|
17-Aug-2020 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: widen ondisk inode timestamps to deal with y2038+ Redesign the ondisk inode timestamps to be a simple unsigned 64-bit counter of nanoseconds since 14 Dec 1901 (i.e. the minimum time in the 32-bit unix time epoch). This enables us to handle dates up to 2486, which solves the y2038 problem. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
876fdc7c |
|
17-Aug-2020 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: explicitly define inode timestamp range Formally define the inode timestamp ranges that existing filesystems support, and switch the vfs timetamp ranges to use it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
2a39946c |
|
17-Aug-2020 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: store inode btree block counts in AGI header Add a btree block usage counters for both inode btrees to the AGI header so that we don't have to walk the entire finobt at mount time to create the per-AG reservations. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
|
#
718ecc50 |
|
17-Aug-2020 |
Dave Chinner <dchinner@redhat.com> |
xfs: xfs_iflock is no longer a completion With the recent rework of the inode cluster flushing, we no longer ever wait on the the inode flush "lock". It was never a lock in the first place, just a completion to allow callers to wait for inode IO to complete. We now never wait for flush completion as all inode flushing is non-blocking. Hence we can get rid of all the iflock infrastructure and instead just set and check a state flag. Rename the XFS_IFLOCK flag to XFS_IFLUSHING, convert all the xfs_iflock_nowait() test-and-set operations on that flag, and replace all the xfs_ifunlock() calls to clear operations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
4750a171 |
|
15-Jul-2020 |
Eric Sandeen <sandeen@redhat.com> |
xfs: preserve inode versioning across remounts The MS_I_VERSION mount flag is exposed via the VFS, as documented in the mount manpages etc; see the iversion and noiversion mount options in mount(8). As a result, mount -o remount looks for this option in /proc/mounts and will only send the I_VERSION flag back in during remount it it is present. Since it's not there, a remount will /remove/ the I_VERSION flag at the vfs level, and iversion functionality is lost. xfs v5 superblocks intend to always have i_version enabled; it is set as a default at mount time, but is lost during remount for the reasons above. The generic fix would be to expose this documented option in /proc/mounts, but since that was rejected, fix it up again in the xfs remount path instead, so that at least xfs won't suffer from this misbehavior. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
c3f2375b |
|
08-Jul-2020 |
Waiman Long <longman@redhat.com> |
xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim Depending on the workloads, the following circular locking dependency warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo lock) may show up: ====================================================== WARNING: possible circular locking dependency detected 5.0.0-rc1+ #60 Tainted: G W ------------------------------------------------------ fsfreeze/4346 is trying to acquire lock: 0000000026f1d784 (fs_reclaim){+.+.}, at: fs_reclaim_acquire.part.19+0x5/0x30 but task is already holding lock: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 which lock already depends on the new lock. : Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sb_internal); lock(fs_reclaim); lock(sb_internal); lock(fs_reclaim); *** DEADLOCK *** 4 locks held by fsfreeze/4346: #0: 00000000b478ef56 (sb_writers#8){++++}, at: percpu_down_write+0xb4/0x650 #1: 000000001ec487a9 (&type->s_umount_key#28){++++}, at: freeze_super+0xda/0x290 #2: 000000003edbd5a0 (sb_pagefaults){++++}, at: percpu_down_write+0xb4/0x650 #3: 0000000072bfc54b (sb_internal){++++}, at: percpu_down_write+0xb4/0x650 stack backtrace: Call Trace: dump_stack+0xe0/0x19a print_circular_bug.isra.10.cold.34+0x2f4/0x435 check_prev_add.constprop.19+0xca1/0x15f0 validate_chain.isra.14+0x11af/0x3b50 __lock_acquire+0x728/0x1200 lock_acquire+0x269/0x5a0 fs_reclaim_acquire.part.19+0x29/0x30 fs_reclaim_acquire+0x19/0x20 kmem_cache_alloc+0x3e/0x3f0 kmem_zone_alloc+0x79/0x150 xfs_trans_alloc+0xfa/0x9d0 xfs_sync_sb+0x86/0x170 xfs_log_sbcount+0x10f/0x140 xfs_quiesce_attr+0x134/0x270 xfs_fs_freeze+0x4a/0x70 freeze_super+0x1af/0x290 do_vfs_ioctl+0xedc/0x16c0 ksys_ioctl+0x41/0x80 __x64_sys_ioctl+0x73/0xa9 do_syscall_64+0x18f/0xd23 entry_SYSCALL_64_after_hwframe+0x49/0xbe This is a false positive as all the dirty pages are flushed out before the filesystem can be frozen. One way to avoid this splat is to add GFP_NOFS to the affected allocation calls by using the memalloc_nofs_save()/memalloc_nofs_restore() pair. This shouldn't matter unless the system is really running out of memory. In that particular case, the filesystem freeze operation may fail while it was succeeding previously. Without this patch, the command sequence below will show that the lock dependency chain sb_internal -> fs_reclaim exists. # fsfreeze -f /home # fsfreeze --unfreeze /home # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal After applying the patch, such sb_internal -> fs_reclaim lock dependency chain can no longer be found. Because of that, the locking dependency warning will not be shown. Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
4d0bab3a |
|
01-Jul-2020 |
Dave Chinner <dchinner@redhat.com> |
xfs: remove SYNC_WAIT from xfs_reclaim_inodes() Clean up xfs_reclaim_inodes() callers. Most callers want blocking behaviour, so just make the existing SYNC_WAIT behaviour the default. For the xfs_reclaim_worker(), just call xfs_reclaim_inodes_ag() directly because we just want optimistic clean inode reclaim to be done in the background. For xfs_quiesce_attr() we can just remove the inode reclaim calls as they are a historic relic that was required to flush dirty inodes that contained unlogged changes. We now log all changes to the inodes, so the sync AIL push from xfs_log_quiesce() called by xfs_quiesce_attr() will do all the required inode writeback for freeze. Seeing as we now want to loop until all reclaimable inodes have been reclaimed, make xfs_reclaim_inodes() loop on the XFS_ICI_RECLAIM_TAG tag rather than having xfs_reclaim_inodes_ag() tell it that inodes were skipped. This is much more reliable and will always loop until all reclaimable inodes are reclaimed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
02beb268 |
|
30-Apr-2020 |
Ira Weiny <ira.weiny@intel.com> |
fs/xfs: Make DAX mount option a tri-state As agreed upon[1]. We make the dax mount option a tri-state. '-o dax' continues to operate the same. We add 'always', 'never', and 'inode' (default). [1] https://lore.kernel.org/lkml/20200405061945.GA94792@iweiny-DESK2.sc.intel.com/ Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
fd58c62b |
|
30-Apr-2020 |
Ira Weiny <ira.weiny@intel.com> |
fs/xfs: Change XFS_MOUNT_DAX to XFS_MOUNT_DAX_ALWAYS In prep for the new tri-state mount option which then introduces XFS_MOUNT_DAX_NEVER. Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
b41b46c2 |
|
20-May-2020 |
Dave Chinner <dchinner@redhat.com> |
xfs: remove the m_active_trans counter It's a global atomic counter, and we are hitting it at a rate of half a million transactions a second, so it's bouncing the counter cacheline all over the place on large machines. We don't actually need it anymore - it used to be required because the VFS freeze code could not track/prevent filesystem transactions that were running, but that problem no longer exists. Hence to remove the counter, we simply have to ensure that nothing calls xfs_sync_sb() while we are trying to quiesce the filesytem. That only happens if the log worker is still running when we call xfs_quiesce_attr(). The log worker is cancelled at the end of xfs_quiesce_attr() by calling xfs_log_quiesce(), so just call it early here and then we can remove the counter altogether. Concurrent create, 50 million inodes, identical 16p/16GB virtual machines on different physical hosts. Machine A has twice the CPU cores per socket of machine B: unpatched patched machine A: 3m16s 2m00s machine B: 4m04s 4m05s Create rates: unpatched patched machine A: 282k+/-31k 468k+/-21k machine B: 231k+/-8k 233k+/-11k Concurrent rm of same 50 million inodes: unpatched patched machine A: 6m42s 2m33s machine B: 4m47s 4m47s The transaction rate on the fast machine went from just under 300k/sec to 700k/sec, which indicates just how much of a bottleneck this atomic counter was. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
9398554f |
|
13-May-2020 |
Christoph Hellwig <hch@lst.de> |
block: remove the error_sector argument to blkdev_issue_flush The argument isn't used by any caller, and drivers don't fill out bi_sector for flush requests either. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
237aac46 |
|
12-May-2020 |
Zheng Bin <zhengbin13@huawei.com> |
xfs: ensure f_bfree returned by statfs() is non-negative Construct an img like this: dd if=/dev/zero of=xfs.img bs=1M count=20 mkfs.xfs -d agcount=1 xfs.img xfs_db -x xfs.img sb 0 write fdblocks 0 agf 0 write freeblks 0 write longest 0 quit mount it, df -h /mnt(xfs mount point), will show this: Filesystem Size Used Avail Use% Mounted on /dev/loop0 17M -64Z -32K 100% /mnt Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
dae2f8ed |
|
30-Apr-2020 |
Ira Weiny <ira.weiny@intel.com> |
fs: Lift XFS_IDONTCACHE to the VFS layer DAX effective mode (S_DAX) changes requires inode eviction. XFS has an advisory flag (XFS_IDONTCACHE) to prevent caching of the inode if no other additional references are taken. We lift this flag to the VFS layer and change the behavior slightly by allowing the flag to remain even if multiple references are taken. This will expedite the eviction of inodes to change S_DAX. Cc: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
8d6c3446 |
|
04-May-2020 |
Ira Weiny <ira.weiny@intel.com> |
fs/xfs: Make DAX mount option a tri-state As agreed upon[1]. We make the dax mount option a tri-state. '-o dax' continues to operate the same. We add 'always', 'never', and 'inode' (default). [1] https://lore.kernel.org/lkml/20200405061945.GA94792@iweiny-DESK2.sc.intel.com/ Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
606723d9 |
|
04-May-2020 |
Ira Weiny <ira.weiny@intel.com> |
fs/xfs: Change XFS_MOUNT_DAX to XFS_MOUNT_DAX_ALWAYS In prep for the new tri-state mount option which then introduces XFS_MOUNT_DAX_NEVER. Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
f0f7a674 |
|
12-Apr-2020 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: move inode flush to the sync workqueue Move the inode dirty data flushing to a workqueue so that multiple threads can take advantage of a single thread's flushing work. The ratelimiting technique used in bdd4ee4 was not successful, because threads that skipped the inode flush scan due to ratelimiting would ENOSPC early, which caused occasional (but noticeable) changes in behavior and sporadic fstest regressions. Therefore, make all the writer threads wait on a single inode flush, which eliminates both the stampeding hordes of flushers and the small window in which a write could fail with ENOSPC because it lost the ratelimit race after even another thread freed space. Fixes: c6425702f21e ("xfs: ratelimit inode flush on buffered write ENOSPC") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
|
#
c6425702 |
|
27-Mar-2020 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: ratelimit inode flush on buffered write ENOSPC A customer reported rcu stalls and softlockup warnings on a computer with many CPU cores and many many more IO threads trying to write to a filesystem that is totally out of space. Subsequent analysis pointed to the many many IO threads calling xfs_flush_inodes -> sync_inodes_sb, which causes a lot of wb_writeback_work to be queued. The writeback worker spends so much time trying to wake the many many threads waiting for writeback completion that it trips the softlockup detector, and (in this case) the system automatically reboots. In addition, they complain that the lengthy xfs_flush_inodes scan traps all of those threads in uninterruptible sleep, which hampers their ability to kill the program or do anything else to escape the situation. If there's thousands of threads trying to write to files on a full filesystem, each of those threads will start separate copies of the inode flush scan. This is kind of pointless since we only need one scan, so rate limit the inode flush. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
d59eadae |
|
24-Mar-2020 |
Dave Chinner <dchinner@redhat.com> |
xfs: correctly acount for reclaimable slabs The XFS inode item slab actually reclaimed by inode shrinker callbacks from the memory reclaim subsystem. These should be marked as reclaimable so the mm subsystem has the full picture of how much memory it can actually reclaim from the XFS slab caches. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
d7167b14 |
|
07-Sep-2019 |
Al Viro <viro@zeniv.linux.org.uk> |
fs_parse: fold fs_parameter_desc/fs_parameter_spec The former contains nothing but a pointer to an array of the latter... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
96cafb9c |
|
06-Dec-2019 |
Eric Sandeen <sandeen@sandeen.net> |
fs_parser: remove fs_parameter_description name field Unused now. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
932befe3 |
|
02-Jan-2020 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: fix s_maxbytes computation on 32-bit kernels I observed a hang in generic/308 while running fstests on a i686 kernel. The hang occurred when trying to purge the pagecache on a large sparse file that had a page created past MAX_LFS_FILESIZE, which caused an integer overflow in the pagecache xarray and resulted in an infinite loop. I then noticed that Linus changed the definition of MAX_LFS_FILESIZE in commit 0cc3b0ec23ce ("Clarify (and fix) MAX_LFS_FILESIZE macros") so that it is now one page short of the maximum page index on 32-bit kernels. Because the XFS function to compute max offset open-codes the 2005-era MAX_LFS_FILESIZE computation and neither the vfs nor mm perform any sanity checking of s_maxbytes, the code in generic/308 can create a page above the pagecache's limit and kaboom. Fix all this by setting s_maxbytes to MAX_LFS_FILESIZE directly and aborting the mount with a warning if our assumptions ever break. I have no answer for why this seems to have been broken for years and nobody noticed. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
aaf54eb8 |
|
14-Nov-2019 |
Carlos Maiolino <cmaiolino@redhat.com> |
xfs: Remove kmem_zone_destroy() wrapper Use kmem_cache_destroy directly Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
b1231760 |
|
14-Nov-2019 |
Carlos Maiolino <cmaiolino@redhat.com> |
xfs: Remove slab init wrappers Remove kmem_zone_init() and kmem_zone_init_flags() together with their specific KM_* to SLAB_* flag wrappers. Use kmem_cache_create() directly. Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
7f6bcf7c |
|
08-Nov-2019 |
Dan Carpenter <dan.carpenter@oracle.com> |
xfs: remove a stray tab in xfs_remount_rw() The extra tab makes the code slightly confusing. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ian Kent <raven@themaw.net> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
0279c71f |
|
06-Nov-2019 |
Colin Ian King <colin.king@canonical.com> |
xfs: remove redundant assignment to variable error Variable error is being initialized with a value that is never read and is being re-assigned a couple of statements later on. The assignment is redundant and hence can be removed. Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
50f83009 |
|
04-Nov-2019 |
Ian Kent <raven@themaw.net> |
xfs: fold xfs_mount-alloc() into xfs_init_fs_context() After switching to use the mount-api the only remaining caller of xfs_mount_alloc() is xfs_init_fs_context(), so fold xfs_mount_alloc() into it. Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
8757c38f |
|
04-Nov-2019 |
Ian Kent <raven@themaw.net> |
xfs: move xfs_fc_parse_param() above xfs_fc_get_tree() Grouping the options parsing and mount handling functions above the struct fs_context_operations but below the struct super_operations should improve (some) the grouping of the super operations while also improving the grouping of the options parsing and mount handling code. Lastly move xfs_fc_parse_param() and related functions down to above xfs_fc_get_tree() and it's related functions. But leave the options enum, struct fs_parameter_spec and the struct fs_parameter_description declarations at the top since that's the logical place for them. This is a straight code move, there aren't any functional changes. Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
2f8d66b3 |
|
04-Nov-2019 |
Ian Kent <raven@themaw.net> |
xfs: move xfs_fc_get_tree() above xfs_fc_reconfigure() Grouping the options parsing and mount handling functions above the struct fs_context_operations but below the struct super_operations should improve (some) the grouping of the super operations while also improving the grouping of the options parsing and mount handling code. Now move xfs_fc_get_tree() and friends, also take the oppertunity to change STATIC to static for the xfs_fs_put_super() function. This is a straight code move, there aren't any functional changes. Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
63cd1e9b |
|
04-Nov-2019 |
Ian Kent <raven@themaw.net> |
xfs: move xfs_fc_reconfigure() above xfs_fc_free() Grouping the options parsing and mount handling functions above the struct fs_context_operations but below the struct super_operations should improve (some) the grouping of the super operations while also improving the grouping of the options parsing and mount handling code. Start by moving xfs_fc_reconfigure() and friends. This is a straight code move, there aren't any functional changes. Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
73e5fff9 |
|
04-Nov-2019 |
Ian Kent <raven@themaw.net> |
xfs: switch to use the new mount-api Define the struct fs_parameter_spec table that's used by the new mount-api for options parsing. Create the various fs context operations methods and define the fs_context_operations struct. Create the fs context initialization method and update the struct file_system_type to utilize it. The initialization function is responsible for working storage initialization, allocation and initialization of file system private information storage and for setting the operations in the fs context. Also set struct file_system_type .parameters to the newly defined struct fs_parameter_spec options parsing table for use by the fs context methods and remove unused code. [darrick: add a comment pointing out the one place where mp->m_super is null] Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
7c89fcb2 |
|
04-Nov-2019 |
Ian Kent <raven@themaw.net> |
xfs: dont set sb in xfs_mount_alloc() When changing to use the new mount api the super block won't be available when the xfs_mount struct is allocated so move setting the super block in xfs_mount to xfs_fs_fill_super(). Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
9a861816 |
|
04-Nov-2019 |
Ian Kent <raven@themaw.net> |
xfs: move xfs_parseargs() validation to a helper Move the validation code of xfs_parseargs() into a helper for later use within the mount context methods. Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
48a06e1b |
|
04-Nov-2019 |
Ian Kent <raven@themaw.net> |
xfs: refactor xfs_parseags() Refactor xfs_parseags(), move the entire token case block to a separate function in an attempt to highlight the code that actually changes in converting to use the new mount api. Also change the break in the switch to a return in the factored out xfs_fc_parse_param() function. Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
846410cc |
|
04-Nov-2019 |
Ian Kent <raven@themaw.net> |
xfs: avoid redundant checks when options is empty When options passed to xfs_parseargs() is NULL the checks performed after taking the branch are made with the initial values of dsunit, dswidth and iosizelog. But all the checks do nothing in this case so return immediately instead. Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
c0a67916 |
|
04-Nov-2019 |
Ian Kent <raven@themaw.net> |
xfs: refactor suffix_kstrtoint() The mount-api doesn't have a "human unit" parse type yet so the options that have values like "10k" etc. still need to be converted by the fs. But the value comes to the fs as a string (not a substring_t type) so there's a need to change the conversion function to take a character string instead. When xfs is switched to use the new mount-api match_kstrtoint() will no longer be used and will be removed. Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
2c6eba31 |
|
04-Nov-2019 |
Ian Kent <raven@themaw.net> |
xfs: add xfs_remount_ro() helper Factor the remount read only code into a helper to simplify the subsequent change from the super block method .remount_fs to the mount-api fs_context_operations method .reconfigure. Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
82332b6d |
|
04-Nov-2019 |
Ian Kent <raven@themaw.net> |
xfs: add xfs_remount_rw() helper Factor the remount read write code into a helper to simplify the subsequent change from the super block method .remount_fs to the mount-api fs_context_operations method .reconfigure. Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
a943f372 |
|
04-Nov-2019 |
Ian Kent <raven@themaw.net> |
xfs: merge freeing of mp names and mp In all cases when struct xfs_mount (mp) fields m_rtname and m_logname are freed mp is also freed, so merge these into a single function xfs_mount_free() Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
7b77b46a |
|
04-Nov-2019 |
Ian Kent <raven@themaw.net> |
xfs: use kmem functions for struct xfs_mount The remount function uses the kmem functions for allocating and freeing struct xfs_mount, for consistency use the kmem functions everwhere for struct xfs_mount. Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
3d9d60d9 |
|
04-Nov-2019 |
Ian Kent <raven@themaw.net> |
xfs: dont use XFS_IS_QUOTA_RUNNING() for option check When CONFIG_XFS_QUOTA is not defined any quota option is invalid. Using the macro XFS_IS_QUOTA_RUNNING() as a check if any quota option has been given is a little misleading so use a simple m_qflags != 0 check to make the intended use more explicit. Also change to use the IS_ENABLED() macro for the kernel config check. Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
e1d3d218 |
|
04-Nov-2019 |
Ian Kent <raven@themaw.net> |
xfs: use super s_id instead of struct xfs_mount m_fsname Eliminate struct xfs_mount field m_fsname by using the super block s_id field directly. Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
f676c750 |
|
04-Nov-2019 |
Ian Kent <raven@themaw.net> |
xfs: remove unused struct xfs_mount field m_fsname_len The struct xfs_mount field m_fsname_len is not used anywhere, remove it. Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
21f55993 |
|
28-Oct-2019 |
Christoph Hellwig <hch@lst.de> |
xfs: merge xfs_showargs into xfs_fs_show_options No need for a trivial wrapper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
1775c506 |
|
28-Oct-2019 |
Christoph Hellwig <hch@lst.de> |
xfs: clean up printing inode32/64 in xfs_showargs inode64 is the only value remaining in the unset array. Special case the inode32/64 options with an explicit seq_printf that prints either inode32 or inode64, and remove the unset array. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
aa58d445 |
|
28-Oct-2019 |
Christoph Hellwig <hch@lst.de> |
xfs: clean up printing the allocsize option in Remove superflous cast. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
7c6b94b1 |
|
28-Oct-2019 |
Christoph Hellwig <hch@lst.de> |
xfs: reverse the polarity of XFS_MOUNT_COMPAT_IOSIZE Replace XFS_MOUNT_COMPAT_IOSIZE with an inverted XFS_MOUNT_LARGEIO flag that makes the usage more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
3274d008 |
|
28-Oct-2019 |
Christoph Hellwig <hch@lst.de> |
xfs: rename the XFS_MOUNT_DFLT_IOSIZE option to Make the flag match the mount option and usage. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
2fcddee8 |
|
28-Oct-2019 |
Christoph Hellwig <hch@lst.de> |
xfs: simplify parsing of allocsize mount option Rework xfs_parseargs to fill out the default value and then parse the option directly into the mount structure, similar to what we do for other updates, and open code the now trivial updates based on on the on-disk superblock directly into xfs_mountfs. Note that this change rejects the allocsize=0 mount option that has been documented as invalid for a long time instead of just ignoring it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
5da8a07c |
|
28-Oct-2019 |
Christoph Hellwig <hch@lst.de> |
xfs: rename the m_writeio_* fields in struct xfs_mount Use the allocsize name to match the mount option and usage instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
3cd1d18b |
|
28-Oct-2019 |
Christoph Hellwig <hch@lst.de> |
xfs: remove the m_readio_* fields in struct xfs_mount m_readio_blocks is entirely unused, and m_readio_blocks is only used in xfs_stat_blksize in a max statements that is a no-op as it always has the same value as m_writeio_log. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
69e8575d |
|
28-Oct-2019 |
Christoph Hellwig <hch@lst.de> |
xfs: remove the dsunit and dswidth variables in There is no real need for the local variables here - either they are applied to the mount structure, or if the noalign mount option is set the mount will fail entirely if either is set. Removing them helps cleaning up the mount API conversion. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
8da57c5c |
|
28-Oct-2019 |
Ian Kent <raven@themaw.net> |
xfs: remove the biosize mount option It appears the biosize mount option hasn't been documented as a valid option since 2005, remove it. Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
598ecfba |
|
17-Oct-2019 |
Christoph Hellwig <hch@lst.de> |
iomap: lift the xfs writeback code to iomap Take the xfs writeback code and move it to fs/iomap. A new structure with three methods is added as the abstraction from the generic writeback code to the file system. These methods are used to map blocks, submit an ioend, and cancel a page that encountered an error before it was added to an ioend. Signed-off-by: Christoph Hellwig <hch@lst.de> [darrick: rename ->submit_ioend to ->prepare_ioend to clarify what it does] Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
8ab39f11 |
|
05-Sep-2019 |
Dave Chinner <dchinner@redhat.com> |
xfs: prevent CIL push holdoff in log recovery generic/530 on a machine with enough ram and a non-preemptible kernel can run the AGI processing phase of log recovery enitrely out of cache. This means it never blocks on locks, never waits for IO and runs entirely through the unlinked lists until it either completes or blocks and hangs because it has run out of log space. It runs out of log space because the background CIL push is scheduled but never runs. queue_work() queues the CIL work on the current CPU that is busy, and the workqueue code will not run it on any other CPU. Hence if the unlinked list processing never yields the CPU voluntarily, the push work is delayed indefinitely. This results in the CIL aggregating changes until all the log space is consumed. When the log recoveyr processing evenutally blocks, the CIL flushes but because the last iclog isn't submitted for IO because it isn't full, the CIL flush never completes and nothing ever moves the log head forwards, or indeed inserts anything into the tail of the log, and hence nothing is able to get the log moving again and recovery hangs. There are several problems here, but the two obvious ones from the trace are that: a) log recovery does not yield the CPU for over 4 seconds, b) binding CIL pushes to a single CPU is a really bad idea. This patch addresses just these two aspects of the problem, and are suitable for backporting to work around any issues in older kernels. The more fundamental problem of preventing the CIL from consuming more than 50% of the log without committing will take more invasive and complex work, so will be done as followup work. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
22b13969 |
|
30-Jul-2019 |
Deepa Dinamani <deepa.kernel@gmail.com> |
fs: Fill in max and min timestamps in superblock Fill in the appropriate limits to avoid inconsistencies in the vfs cached inode times when timestamps are outside the permitted range. Even though some filesystems are read-only, fill in the timestamps to reflect the on-disk representation. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-By: Tigran Aivazian <aivazian.tigran@gmail.com> Acked-by: Jeff Layton <jlayton@kernel.org> Cc: aivazian.tigran@gmail.com Cc: al@alarsen.net Cc: coda@cs.cmu.edu Cc: darrick.wong@oracle.com Cc: dushistov@mail.ru Cc: dwmw2@infradead.org Cc: hch@infradead.org Cc: jack@suse.com Cc: jaharkes@cs.cmu.edu Cc: luisbg@kernel.org Cc: nico@fluxnic.net Cc: phillip@squashfs.org.uk Cc: richard@nod.at Cc: salah.triki@gmail.com Cc: shaggy@kernel.org Cc: linux-xfs@vger.kernel.org Cc: codalist@coda.cs.cmu.edu Cc: linux-ext4@vger.kernel.org Cc: linux-mtd@lists.infradead.org Cc: jfs-discussion@lists.sourceforge.net Cc: reiserfs-devel@vger.kernel.org
|
#
250d4b4c |
|
28-Jun-2019 |
Eric Sandeen <sandeen@sandeen.net> |
xfs: remove unused header files There are many, many xfs header files which are included but unneeded (or included twice) in the xfs code, so remove them. nb: xfs_linux.h includes about 9 headers for everyone, so those explicit includes get removed by this. I'm not sure what the preference is, but if we wanted explicit includes everywhere, a followup patch could remove those xfs_*.h includes from xfs_linux.h and move them into the files that need them. Or it could be left as-is. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
adfb5fb4 |
|
28-Jun-2019 |
Christoph Hellwig <hch@lst.de> |
xfs: implement cgroup aware writeback Link every newly allocated writeback bio to cgroup pointed to by the writeback control structure, and charge every byte written back to it. Tested-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
1058d0f5 |
|
28-Jun-2019 |
Christoph Hellwig <hch@lst.de> |
xfs: move the log ioend workqueue to struct xlog Move the workqueue used for log I/O completions from struct xfs_mount to struct xlog to keep it self contained in the log code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: destroy the log workqueue after ensuring log ios are done] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
ef325959 |
|
05-Jun-2019 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: separate inode geometry Separate the inode geometry information into a distinct structure. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
91083269 |
|
01-May-2019 |
Eric Sandeen <sandeen@sandeen.net> |
xfs: change some error-less functions to void types There are several functions which have no opportunity to return an error, and don't contain any ASSERTs which could be argued to be better constructed as error cases. So, make them voids to simplify the callers. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
94079285 |
|
28-Apr-2019 |
Christoph Hellwig <hch@lst.de> |
xfs: don't parse the mtpt mount option The text isn't really any more useful than the default unknown option handling. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
ed30dcbd |
|
25-Apr-2019 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: rename the speculative block allocation reclaim toggle functions "reclaim" is used throughout the icache code to mean reclamation of incore inode structures. It's also used for two helper functions that toggle background deletion of speculative preallocations. Separate the second of the two uses to make things less confusing. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
9fe82b8c |
|
25-Apr-2019 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: track delayed allocation reservations across the filesystem Add a percpu counter to track the number of blocks directly reserved for delayed allocations on the data device. This counter (in contrast to i_delayed_blks) does not track allocated CoW staging extents or anything going on with the realtime device. It will be used in the upcoming summary counter scrub function to check the free block counts without having to freeze the filesystem or walk all the inodes to find the delayed allocations. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
28408243 |
|
15-Apr-2019 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: remove unused m_data_workqueue Now that we're no longer using m_data_workqueue, remove it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
|
#
72deb455 |
|
05-Apr-2019 |
Christoph Hellwig <hch@lst.de> |
block: remove CONFIG_LBDAF Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit architectures. These types are required to support block device and/or file sizes larger than 2 TiB, and have generally defaulted to on for a long time. Enabling the option only increases the i386 tinyconfig size by 145 bytes, and many data structures already always use 64-bit values for their in-core and on-disk data structures anyway, so there should not be a large change in dynamic memory usage either. Dropping this option removes a somewhat weird non-default config that has cause various bugs or compiler warnings when actually used. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
66ae56a5 |
|
18-Feb-2019 |
Christoph Hellwig <hch@lst.de> |
xfs: introduce an always_cow mode Add a mode where XFS never overwrites existing blocks in place. This is to aid debugging our COW code, and also put infatructure in place for things like possible future support for zoned block devices, which can't support overwrites. This mode is enabled globally by doing a: echo 1 > /sys/fs/xfs/debug/always_cow Note that the parameter is global to allow running all tests in xfstests easily in this mode, which would not easily be possible with a per-fs sysfs file. In always_cow mode persistent preallocations are disabled, and fallocate will fail when called with a 0 mode (with our without FALLOC_FL_KEEP_SIZE), and not create unwritten extent for zeroed space when called with FALLOC_FL_ZERO_RANGE or FALLOC_FL_UNSHARE_RANGE. There are a few interesting xfstests failures when run in always_cow mode: - generic/392 fails because the bytes used in the file used to test hole punch recovery are less after the log replay. This is because the blocks written and then punched out are only freed with a delay due to the logging mechanism. - xfs/170 will fail as the already fragile file streams mechanism doesn't seem to interact well with the COW allocator - xfs/180 xfs/182 xfs/192 xfs/198 xfs/204 and xfs/208 will claim the file system is badly fragmented, but there is not much we can do to avoid that when always writing out of place - xfs/205 fails because overwriting a file in always_cow mode will require new space allocation and the assumption in the test thus don't work anymore. - xfs/326 fails to modify the file at all in always_cow mode after injecting the refcount error, leading to an unexpected md5sum after the remount, but that again is expected Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
15a268d9 |
|
13-Feb-2019 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: reserve blocks for ifree transaction during log recovery Log recovery frees all the inodes stored in the unlinked list, which can cause expansion of the free inode btree. The ifree code skips block reservations if it thinks there's a per-AG space reservation, but we don't set up the reservation until after log recovery, which means that a finobt expansion blows up in xfs_trans_mod_sb when we exceed the transaction's block reservation. To fix this, we set the "no finobt reservation" flag to true when we create the xfs_mount and only set it to false if we confirm that every AG had enough free space to put aside for the finobt. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
43004b2a |
|
12-Dec-2018 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: add a block to inode count converter Add new helpers to convert units of fs blocks into inodes, and AG blocks into AG inodes, respectively. Convert all the open-coded conversions and XFS_OFFBNO_TO_AGINO(, , 0) calls to use them, as appropriate. The OFFBNO_TO_AGINO macro is retained for xfs_repair. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
|
#
bc9f2b7c |
|
12-Dec-2018 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: idiotproof defer op type configuration Recently, we forgot to port a new defer op type to xfsprogs, which caused us some userspace pain. Reorganize the way we make libxfs clients supply defer op type information so that all type information has to be provided at build time instead of risky runtime dynamic configuration. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
|
#
dddde68b |
|
18-Oct-2018 |
Adam Borowski <kilobyte@angband.pl> |
xfs: add a define for statfs magic to uapi Needed by userspace programs that call fstatfs(). It'd be natural to publish XFS_SB_MAGIC in uapi, but while these two have identical values, they have different semantic meaning: one is an enum cookie meant for statfs, the other a signature of the on-disk format. Signed-off-by: Adam Borowski <kilobyte@angband.pl> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
4831822f |
|
18-Oct-2018 |
Christoph Hellwig <hch@lst.de> |
xfs: print dangling delalloc extents Instead of just asserting that we have no delalloc space dangling in an inode that gets freed print the actual offenders for debug mode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
3ba738df |
|
17-Jul-2018 |
Christoph Hellwig <hch@lst.de> |
xfs: remove the xfs_ifork_t typedef We only have a few more callers left, so seize the opportunity and kill it off. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
1c02d502 |
|
26-Jul-2018 |
Eric Sandeen <sandeen@sandeen.net> |
xfs: remove deprecated barrier/nobarrier mount The barrier mount options have been no-ops and deprecated since 4cf4573 xfs: deprecate barrier/nobarrier mount option i.e. kernel 4.10 / December 2016, with a stated deprecation schedule after v4.15. Should be fair game to remove them now. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
82cb1417 |
|
11-Jul-2018 |
Christoph Hellwig <hch@lst.de> |
xfs: add support for sub-pagesize writeback without buffer_heads Switch to using the iomap_page structure for checking sub-page uptodate status and track sub-page I/O completion status, and remove large quantities of boilerplate code working around buffer heads. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
9bb54cb5 |
|
07-Jun-2018 |
Dave Chinner <dchinner@redhat.com> |
xfs: clean up MIN/MAX Get rid of the MIN/MAX macros and just use the native min/max macros directly in the XFS code. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
0b61f8a4 |
|
05-Jun-2018 |
Dave Chinner <dchinner@redhat.com> |
xfs: convert to SPDX license tags Remove the verbose license text from XFS files and replace them with SPDX tags. This does not change the license of any of the code, merely refers to the common, up-to-date license files in LICENSES/ This change was mostly scripted. fs/xfs/Makefile and fs/xfs/libxfs/xfs_fs.h were modified by hand, the rest were detected and modified by the following command: for f in `git grep -l "GNU General" fs/xfs/` ; do echo $f cat $f | awk -f hdr.awk > $f.new mv -f $f.new $f done And the hdr.awk script that did the modification (including detecting the difference between GPL-2.0 and GPL-2.0+ licenses) is as follows: $ cat hdr.awk BEGIN { hdr = 1.0 tag = "GPL-2.0" str = "" } /^ \* This program is free software/ { hdr = 2.0; next } /any later version./ { tag = "GPL-2.0+" next } /^ \*\// { if (hdr > 0.0) { print "// SPDX-License-Identifier: " tag print str print $0 str="" hdr = 0.0 next } print $0 next } /^ \* / { if (hdr > 1.0) next if (hdr > 0.0) { if (str != "") str = str "\n" str = str $0 next } print $0 next } /^ \*/ { if (hdr > 0.0) next print $0 next } // { if (hdr > 0.0) { if (str != "") str = str "\n" str = str $0 next } print $0 } END { } $ Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
80660f20 |
|
30-May-2018 |
Dave Jiang <dave.jiang@intel.com> |
dax: change bdev_dax_supported() to support boolean returns The function return values are confusing with the way the function is named. We expect a true or false return value but it actually returns 0/-errno. This makes the code very confusing. Changing the return values to return a bool where if DAX is supported then return true and no DAX support returns false. Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
ba23cba9 |
|
30-May-2018 |
Darrick J. Wong <darrick.wong@oracle.com> |
fs: allow per-device dax status checking for filesystems Change bdev_dax_supported so it takes a bdev parameter. This enables multi-device filesystems like xfs to check that a dax device can work for the particular filesystem. Once that's in place, actually fix all the parts of XFS where we need to be able to distinguish between datadev and rtdev. This patch fixes the problem where we screw up the dax support checking in xfs if the datadev and rtdev have different dax capabilities. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> [rez: Re-added __bdev_dax_supported() for !CONFIG_FS_DAX cases] Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
|
#
e292d7bc |
|
20-May-2018 |
Kent Overstreet <kent.overstreet@gmail.com> |
xfs: convert to bioset_init()/mempool_init() Convert XFS to embedded bio sets. Acked-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
c9fbd7bb |
|
10-May-2018 |
Dave Chinner <dchinner@redhat.com> |
xfs: clear sb->s_fs_info on mount failure We recently had an oops reported on a 4.14 kernel in xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage and so the m_perag_tree lookup walked into lala land. Essentially, the machine was under memory pressure when the mount was being run, xfs_fs_fill_super() failed after allocating the xfs_mount and attaching it to sb->s_fs_info. It then cleaned up and freed the xfs_mount, but the sb->s_fs_info field still pointed to the freed memory. Hence when the superblock shrinker then ran it fell off the bad pointer. With the superblock shrinker problem fixed at teh VFS level, this stale s_fs_info pointer is still a problem - we use it unconditionally in ->put_super when the superblock is being torn down, and hence we can still trip over it after a ->fill_super call failure. Hence we need to clear s_fs_info if xfs-fs_fill_super() fails, and we need to check if it's valid in the places it can potentially be dereferenced after a ->fill_super failure. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
dae5cd81 |
|
10-May-2018 |
Dave Chinner <dchinner@redhat.com> |
xfs: add mount delay debug option Similar to log_recovery_delay, this delay occurs between the VFS superblock being initialised and the xfs_mount being fully initialised. It also poisons the per-ag radix tree node so that it can be used for triggering shrinker races during mount such as the following: <run memory pressure workload in background> $ cat dirty-mount.sh #! /bin/bash umount -f /dev/pmem0 mkfs.xfs -f /dev/pmem0 mount /dev/pmem0 /mnt/test rm -f /mnt/test/foo xfs_io -fxc "pwrite 0 4k" -c fsync -c "shutdown" /mnt/test/foo umount /dev/pmem0 # let's crash it now! echo 30 > /sys/fs/xfs/debug/mount_delay mount /dev/pmem0 /mnt/test echo 0 > /sys/fs/xfs/debug/mount_delay umount /dev/pmem0 $ sudo ./dirty-mount.sh ..... [ 60.378118] CPU: 3 PID: 3577 Comm: fs_mark Tainted: G D W 4.16.0-rc5-dgc #440 [ 60.378120] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [ 60.378124] RIP: 0010:radix_tree_next_chunk+0x76/0x320 [ 60.378127] RSP: 0018:ffffc9000276f4f8 EFLAGS: 00010282 [ 60.383670] RAX: a5a5a5a5a5a5a5a4 RBX: 0000000000000010 RCX: 000000000000001a [ 60.385277] RDX: 0000000000000000 RSI: ffffc9000276f540 RDI: 0000000000000000 [ 60.386554] RBP: 0000000000000000 R08: 0000000000000000 R09: a5a5a5a5a5a5a5a5 [ 60.388194] R10: 0000000000000006 R11: 0000000000000001 R12: ffffc9000276f598 [ 60.389288] R13: 0000000000000040 R14: 0000000000000228 R15: ffff880816cd6458 [ 60.390827] FS: 00007f5c124b9740(0000) GS:ffff88083fc00000(0000) knlGS:0000000000000000 [ 60.392253] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 60.393423] CR2: 00007f5c11bba0b8 CR3: 000000035580e001 CR4: 00000000000606e0 [ 60.394519] Call Trace: [ 60.395252] radix_tree_gang_lookup_tag+0xc4/0x130 [ 60.395948] xfs_perag_get_tag+0x37/0xf0 [ 60.396522] xfs_reclaim_inodes_count+0x32/0x40 [ 60.397178] xfs_fs_nr_cached_objects+0x11/0x20 [ 60.397837] super_cache_count+0x35/0xc0 [ 60.399159] shrink_slab.part.66+0xb1/0x370 [ 60.400194] shrink_node+0x7e/0x1a0 [ 60.401058] try_to_free_pages+0x199/0x470 [ 60.402081] __alloc_pages_slowpath+0x3a1/0xd20 [ 60.403729] __alloc_pages_nodemask+0x1c3/0x200 [ 60.404941] cache_grow_begin+0x20b/0x2e0 [ 60.406164] fallback_alloc+0x160/0x200 [ 60.407088] kmem_cache_alloc+0x111/0x4e0 [ 60.408038] ? xfs_buf_rele+0x61/0x430 [ 60.408925] kmem_zone_alloc+0x61/0xe0 [ 60.409965] xfs_inode_alloc+0x24/0x1d0 ..... Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
d6b636eb |
|
09-May-2018 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: halt auto-reclamation activities while rebuilding rmap Rebuilding the reverse-mapping tree requires us to quiesce all inodes in the filesystem, so we must stop background reclamation of post-EOF and CoW prealloc blocks. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
e6631f85 |
|
09-May-2018 |
Dave Chinner <dchinner@redhat.com> |
xfs: get rid of the log item descriptor It's just a connector between a transaction and a log item. There's a 1:1 relationship between a log item descriptor and a log item, and a 1:1 relationship between a log item descriptor and a transaction. Both relationships are created and terminated at the same time, so why do we even have the descriptor? Replace it with a specific list_head in the log item and a new log item dirtied flag to replace the XFS_LID_DIRTY flag. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> [darrick: fix up deferred agfl intent finish_item use of LID_DIRTY] Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
a1f69417 |
|
06-Apr-2018 |
Eric Sandeen <sandeen@sandeen.net> |
xfs: non-scrub - remove unused function parameters Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
72c44e35 |
|
23-Mar-2018 |
Brian Foster <bfoster@redhat.com> |
xfs: clean up xfs_mount allocation and dynamic initializers Most of the generic data structures embedded in xfs_mount are dynamically initialized immediately after mp is allocated. A few fields are left out and initialized during the xfs_mountfs() sequence, after mp has been attached to the superblock. To clean this up and help prevent premature access of associated fields, refactor xfs_mount allocation and all dependent init calls into a new helper. This self-documents that all low level data structures (i.e., locks, trees, etc.) should be initialized before xfs_mount is attached to the superblock. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
6231848c |
|
06-Mar-2018 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: check for cow blocks before trying to clear them There's no point in allocating a transaction and locking the inode in preparation to clear cow blocks if there actually are any cow fork extents. Therefore, move the xfs_reflink_cancel_cow_range hunk to xfs_inactive and check the cow ifp first. This makes inode reclamation run faster. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
|
#
c3b1b131 |
|
06-Mar-2018 |
Christoph Hellwig <hch@lst.de> |
xfs: implement the lazytime mount option Use the VFS dirty inode tracking for lazytime inodes only, and just log them in ->dirty_inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
5b4c845e |
|
25-Feb-2018 |
Chengguang Xu <cgxu519@icloud.com> |
xfs: fix potential memory leak in mount option parsing When specifying string type mount option (e.g., logdev) several times in a mount, current option parsing may cause memory leak. Hence, call kfree for previous one in this case. Signed-off-by: Chengguang Xu <cgxu519@icloud.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
76883f79 |
|
31-Jan-2018 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: remove experimental tag for reverse mapping Reverse mapping has had a while to soak, so remove the experimental tag. Now that we've landed space metadata cross-referencing in scrub, the feature actually has a purpose. Reject rmap filesystems with an rt device until the code to support it is actually implemented. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com>
|
#
c14632dd |
|
31-Jan-2018 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: don't allow reflink + realtime filesystems We don't support realtime filesystems with reflink either, so fail those mounts. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com>
|
#
b6e03c10 |
|
31-Jan-2018 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: don't allow DAX on reflink filesystems Now that reflink is no longer experimental, reject attempts to mount with DAX until that whole mess gets sorted out. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
1e369b0e |
|
08-Jan-2018 |
Christoph Hellwig <hch@lst.de> |
xfs: remove experimental tag for reflinks But reject reflink + DAX file systems for now until the code to support reflinks on DAX is actually implemented. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: port to 4.16] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
a0158315 |
|
08-Jan-2018 |
Richard Wareing <rwareing@fb.com> |
xfs: Show realtime device stats on statfs calls if realtime flags set - Reports realtime device free blocks in statfs calls if (realtime) inheritance bit is set on the inode of directory, or realtime flag in the case of files. This is a bit more intuitive, especially for use-cases which are using a much larger device for the realtime device. - Add XFS_IS_REALTIME_MOUNT option to gate based on the existence of a realtime device on the mount, similar to the XFS_IS_REALTIME_INODE option. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Richard Wareing <rwareing@fb.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
10ddf64e |
|
14-Dec-2017 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: remove leftover CoW reservations when remounting ro When we're remounting the filesystem readonly, remove all CoW preallocations prior to going ro. If the fs goes down after the ro remount, we never clean up the staging extents, which means xfs_check will trip over them on a subsequent run. Practically speaking, the next mount will clean them up too, so this is unlikely to be seen. Since we shut down the cowblocks cleaner on remount-ro, we also have to make sure we start it back up if/when we remount-rw. Found by adding clonerange to fsstress and running xfs/017. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
1751e8a6 |
|
27-Nov-2017 |
Linus Torvalds <torvalds@linux-foundation.org> |
Rename superblock flags (MS_xyz -> SB_xyz) This is a pure automated search-and-replace of the internal kernel superblock flags. The s_flags are now called SB_*, with the names and the values for the moment mirroring the MS_* flags that they're equivalent to. Note how the MS_xyz flags are the ones passed to the mount system call, while the SB_xyz flags are what we then use in sb->s_flags. The script to do this was: # places to look in; re security/*: it generally should *not* be # touched (that stuff parses mount(2) arguments directly), but # there are two places where we really deal with superblock flags. FILES="drivers/mtd drivers/staging/lustre fs ipc mm \ include/linux/fs.h include/uapi/linux/bfs_fs.h \ security/apparmor/apparmorfs.c security/apparmor/include/lib.h" # the list of MS_... constants SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \ DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \ POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \ I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \ ACTIVE NOUSER" SED_PROG= for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done # we want files that contain at least one of MS_..., # with fs/namespace.c and fs/pnode.c excluded. L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c') for f in $L; do sed -i $f $SED_PROG; done Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
357fdad0 |
|
18-Oct-2017 |
Matthew Garrett <mjg59@google.com> |
Convert fs/*/* to SB_I_VERSION [AV: in addition to the fix in previous commit] Signed-off-by: Matthew Garrett <mjg59@google.com> Cc: David Howells <dhowells@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
1e6fa688 |
|
18-Sep-2017 |
Kenjiro Nakayama <nakayamakenjiro@gmail.com> |
xfs: Output warning message when discard option was enabled even though the device does not support discard In order to using discard function, it is necessary that not only xfs is mounted with discard option, but also the discard function is supported by the device. Current code doesn't output any message when users mount with discard option on unsupported device, so it is difficult to notice that it was not enabled actually. This patch adds the warning message to notice that discard option is not enabled due to unsupported device when the filesystem is mounted. Changes in v2 (Suggested by Brian Foster): - Move the unsupported device check into xfs_fs_fill_super(). - Clear the discard flag when device is unsupported. Signed-off-by: Kenjiro Nakayama <nakayamakenjiro@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
6c370590 |
|
02-Sep-2017 |
Pan Bian <bianpan2016@163.com> |
xfs: use kmem_free to free return value of kmem_zalloc In function xfs_test_remount_options(), kfree() is used to free memory allocated by kmem_zalloc(). But it is better to use kmem_free(). Signed-off-by: Pan Bian <bianpan2016@163.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
486aff5e |
|
24-Aug-2017 |
Dan Williams <dan.j.williams@intel.com> |
xfs: perform dax_device lookup at mount The ->iomap_begin() operation is a hot path, so cache the fs_dax_get_by_host() result at mount time to avoid the incurring the hash lookup overhead on a per-i/o basis. Reported-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
#
bc98a42c |
|
17-Jul-2017 |
David Howells <dhowells@redhat.com> |
VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb) Firstly by applying the following with coccinelle's spatch: @@ expression SB; @@ -SB->s_flags & MS_RDONLY +sb_rdonly(SB) to effect the conversion to sb_rdonly(sb), then by applying: @@ expression A, SB; @@ ( -(!sb_rdonly(SB)) && A +!sb_rdonly(SB) && A | -A != (sb_rdonly(SB)) +A != sb_rdonly(SB) | -A == (sb_rdonly(SB)) +A == sb_rdonly(SB) | -!(sb_rdonly(SB)) +!sb_rdonly(SB) | -A && (sb_rdonly(SB)) +A && sb_rdonly(SB) | -A || (sb_rdonly(SB)) +A || sb_rdonly(SB) | -(sb_rdonly(SB)) != A +sb_rdonly(SB) != A | -(sb_rdonly(SB)) == A +sb_rdonly(SB) == A | -(sb_rdonly(SB)) && A +sb_rdonly(SB) && A | -(sb_rdonly(SB)) || A +sb_rdonly(SB) || A ) @@ expression A, B, SB; @@ ( -(sb_rdonly(SB)) ? 1 : 0 +sb_rdonly(SB) | -(sb_rdonly(SB)) ? A : B +sb_rdonly(SB) ? A : B ) to remove left over excess bracketage and finally by applying: @@ expression A, SB; @@ ( -(A & MS_RDONLY) != sb_rdonly(SB) +(bool)(A & MS_RDONLY) != sb_rdonly(SB) | -(A & MS_RDONLY) == sb_rdonly(SB) +(bool)(A & MS_RDONLY) == sb_rdonly(SB) ) to make comparisons against the result of sb_rdonly() (which is a bool) work correctly. Signed-off-by: David Howells <dhowells@redhat.com>
|
#
c8ce540d |
|
16-Jun-2017 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: remove double-underscore integer types This is a purely mechanical patch that removes the private __{u,}int{8,16,32,64}_t typedefs in favor of using the system {u,}int{8,16,32,64}_t typedefs. This is the sed script used to perform the transformation and fix the resulting whitespace and indentation errors: s/typedef\t__uint8_t/typedef __uint8_t\t/g s/typedef\t__uint/typedef __uint/g s/typedef\t__int\([0-9]*\)_t/typedef int\1_t\t/g s/__uint8_t\t/__uint8_t\t\t/g s/__uint/uint/g s/__int\([0-9]*\)_t\t/__int\1_t\t\t/g s/__int/int/g /^typedef.*int[0-9]*_t;$/d Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
011067b0 |
|
17-Jun-2017 |
NeilBrown <neilb@suse.com> |
blk: replace bioset_create_nobvec() with a flags arg to bioset_create() "flags" arguments are often seen as good API design as they allow easy extensibility. bioset_create_nobvec() is implemented internally as a variation in flags passed to __bioset_create(). To support future extension, make the internal structure part of the API. i.e. add a 'flags' argument to bioset_create() and discard bioset_create_nobvec(). Note that the bio_split allocations in drivers/md/raid* do not need the bvec mempool - they should have used bioset_create_nobvec(). Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
#
ef510424 |
|
08-May-2017 |
Dan Williams <dan.j.williams@intel.com> |
block, dax: move "select DAX" from BLOCK to FS_DAX For configurations that do not enable DAX filesystems or drivers, do not require the DAX core to be built. Given that the 'direct_access' method has been removed from 'block_device_operations', we can also go ahead and remove the block-related dax helper functions from fs/block_dev.c to drivers/dax/super.c. This keeps dax details out of the block layer and lets the DAX core be built as a module in the FS_DAX=n case. Filesystems need to include dax.h to call bdev_dax_supported(). Cc: linux-xfs@vger.kernel.org Cc: Jens Axboe <axboe@kernel.dk> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
#
696a5620 |
|
28-Mar-2017 |
Brian Foster <bfoster@redhat.com> |
xfs: use dedicated log worker wq to avoid deadlock with cil wq The log covering background task used to be part of the xfssyncd workqueue. That workqueue was removed as of commit 5889608df ("xfs: syncd workqueue is no more") and the associated work item scheduled to the xfs-log wq. The latter is used for log buffer I/O completion. Since xfs_log_worker() can invoke a log flush, a deadlock is possible between the xfs-log and xfs-cil workqueues. Consider the following codepath from xfs_log_worker(): xfs_log_worker() xfs_log_force() _xfs_log_force() xlog_cil_force() xlog_cil_force_lsn() xlog_cil_push_now() flush_work() The above is in xfs-log wq context and blocked waiting on the completion of an xfs-cil work item. Concurrently, the cil push in progress can end up blocked here: xlog_cil_push_work() xlog_cil_push() xlog_write() xlog_state_get_iclog_space() xlog_wait(&log->l_flush_wait, ...) The above is in xfs-cil context waiting on log buffer I/O completion, which executes in xfs-log wq context. In this scenario both workqueues are deadlocked waiting on eachother. Add a new workqueue specifically for the high level log covering and ail pushing worker, as was the case prior to commit 5889608df. Diagnosed-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
3802a345 |
|
07-Mar-2017 |
Christoph Hellwig <hch@lst.de> |
xfs: only reclaim unwritten COW extents periodically We only want to reclaim preallocations from our periodic work item. Currently this is archived by looking for a dirty inode, but that check is rather fragile. Instead add a flag to xfs_reflink_cancel_cow_* so that the caller can ask for just cancelling unwritten extents in the COW fork. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: fix typos in commit message] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
4560e78f |
|
07-Feb-2017 |
Christoph Hellwig <hch@lst.de> |
xfs: don't block the log commit handler for discards Instead we submit the discard requests and use another workqueue to release the extents from the extent busy list. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
4cf4573d |
|
08-Dec-2016 |
Dave Chinner <dchinner@redhat.com> |
xfs: deprecate barrier/nobarrier mount option We always perform integrity operations now, so these mount options don't do anything. Deprecate them and mark them for removal in in a year. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
65523218 |
|
29-Nov-2016 |
Christoph Hellwig <hch@lst.de> |
xfs: remove i_iolock and use i_rwsem in the VFS inode instead This patch drops the XFS-own i_iolock and uses the VFS i_rwsem which recently replaced i_mutex instead. This means we only have to take one lock instead of two in many fast path operations, and we can also shrink the xfs_inode structure. Thanks to the xfs_ilock family there is very little churn, the only thing of note is that we need to switch to use the lock_two_directory helper for taking the i_rwsem on two inodes in a few places to make sure our lock order matches the one used in the VFS. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Jens Axboe <axboe@fb.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
63646fc5 |
|
09-Oct-2016 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: check inode reflink flag before calling reflink functions There are a couple of places where we don't check the inode's reflink flag before calling into the reflink code. Fix those, and add some asserts so we don't make this mistake again. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reported-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
e54b5bf9 |
|
03-Oct-2016 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: recognize the reflink feature bit Add the reflink feature flag to the set of recognized feature flags. This enables users to write to reflink filesystems. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
83104d44 |
|
03-Oct-2016 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: garbage collect old cowextsz reservations Trim CoW reservations made on behalf of a cowextsz hint if they get too old or we run low on quota, so long as we don't have dirty data awaiting writeback or directio operations in progress. Garbage collection of the cowextsize extents are kept separate from prealloc extent reaping because setting the CoW prealloc lifetime to a (much) higher value than the regular prealloc extent lifetime has been useful for combatting CoW fragmentation on VM hosts where the VMs experience bursty write behaviors and we can keep the utilization ratios low enough that we don't start to run out of space. IOWs, it benefits us to keep the CoW fork reservations around for as long as we can unless we run out of blocks or hit inode reclaim. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
84d69619 |
|
03-Oct-2016 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: preallocate blocks for worst-case btree expansion To gracefully handle the situation where a CoW operation turns a single refcount extent into a lot of tiny ones and then run out of space when a tree split has to happen, use the per-AG reserved block pool to pre-allocate all the space we'll ever need for a maximal btree. For a 4K block size, this only costs an overhead of 0.3% of available disk space. When reflink is enabled, we have an unfortunate problem with rmap -- since we can share a block billions of times, this means that the reverse mapping btree can expand basically infinitely. When an AG is so full that there are no free blocks with which to expand the rmapbt, the filesystem will shut down hard. This is rather annoying to the user, so use the AG reservation code to reserve a "reasonable" amount of space for rmap. We'll prevent reflinks and CoW operations if we think we're getting close to exhausting an AG's free space rather than shutting down, but this permanent reservation should be enough for "most" users. Hopefully. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> [hch@lst.de: ensure that we invalidate the freed btree buffer] Signed-off-by: Christoph Hellwig <hch@lst.de>
|
#
174edb0e |
|
03-Oct-2016 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: store in-progress CoW allocations in the refcount btree Due to the way the CoW algorithm in XFS works, there's an interval during which blocks allocated to handle a CoW can be lost -- if the FS goes down after the blocks are allocated but before the block remapping takes place. This is exacerbated by the cowextsz hint -- allocated reservations can sit around for a while, waiting to get used. Since the refcount btree doesn't normally store records with refcount of 1, we can use it to record these in-progress extents. In-progress blocks cannot be shared because they're not user-visible, so there shouldn't be any conflicts with other programs. This is a better solution than holding EFIs during writeback because (a) EFIs can't be relogged currently, (b) even if they could, EFIs are bound by available log space, which puts an unnecessary upper bound on how much CoW we can have in flight, and (c) we already have a mechanism to track blocks. At mount time, read the refcount records and free anything we find with a refcount of 1 because those were in-progress when the FS went down. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
5e7e605c |
|
03-Oct-2016 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: cancel pending CoW reservations when destroying inodes When destroying the inode, cancel all pending reservations in the CoW fork so that all the reserved blocks go back to the free pile. In theory this sort of cleanup is only needed to clean up after write errors. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
17c12bcd |
|
03-Oct-2016 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: when replaying bmap operations, don't let unlinked inodes get reaped Log recovery will iget an inode to replay BUI items and iput the inode when it's done. Unfortunately, if the inode was unlinked, the iput will see that i_nlink == 0 and decide to truncate & free the inode, which prevents us from replaying subsequent BUIs. We can't skip the BUIs because we have to replay all the redo items to ensure that atomic operations complete. Since unlinked inode recovery will reap the inode anyway, we can safely introduce a new inode flag to indicate that an inode is in this 'unlinked recovery' state and should not be auto-reaped in the drop_inode path. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
9f3afb57 |
|
03-Oct-2016 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: implement deferred bmbt map/unmap operations Implement deferred versions of the inode block map/unmap functions. These will be used in subsequent patches to make reflink operations atomic. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
6413a014 |
|
03-Oct-2016 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: create bmbt update intent log items Create bmbt update intent/done log items to record redo information in the log. Because we roll transactions multiple times for reflink operations, we also have to track the status of the metadata updates that will be recorded in the post-roll transactions in case we crash before committing the final transaction. This mechanism enables log recovery to finish what was already started. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
33ba6129 |
|
03-Oct-2016 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: connect refcount adjust functions to upper layers Plumb in the upper level interface to schedule and finish deferred refcount operations via the deferred ops mechanism. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
baf4bcac |
|
03-Oct-2016 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: create refcount update intent log items Create refcount update intent/done log items to record redo information in the log. Because we need to roll transactions between updating the bmbt mapping and updating the reverse mapping, we also have to track the status of the metadata updates that will be recorded in the post-roll transactions, just in case we crash before committing the final transaction. This mechanism enables log recovery to finish what was already started. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
ddeb14f4 |
|
25-Sep-2016 |
Dave Chinner <dchinner@redhat.com> |
xfs: quiesce the filesystem after recovery on readonly mount Recently we've had a number of reports where log recovery on a v5 filesystem has reported corruptions that looked to be caused by recovery being re-run over the top of an already-recovered metadata. This has uncovered a bug in recovery (fixed elsewhere) but the vector that caused this was largely unknown. A kdump test started tripping over this problem - the system would be crashed, the kdump kernel and environment would boot and dump the kernel core image, and then the system would reboot. After reboot, the root filesystem was triggering log recovery and corruptions were being detected. The metadumps indicated the above log recovery issue. What is happening is that the kdump kernel and environment is mounting the root device read-only to find the binaries needed to do it's work. The result of this is that it is running log recovery. However, because there were unlinked files and EFIs to be processed by recovery, the completion of phase 1 of log recovery could not mark the log clean. And because it's a read-only mount, the unmount process does not write records to the log to mark it clean, either. Hence on the next mount of the filesystem, log recovery was run again across all the metadata that had already been recovered and this is what triggered corruption warnings. To avoid this problem, we need to ensure that a read-only mount always updates the log when it completes the second phase of recovery. We already handle this sort of issue with rw->ro remount transitions, so the solution is as simple as quiescing the filesystem at the appropriate time during the mount process. This results in the log being marked clean so the mount behaviour recorded in the logs on repeated RO mounts will change (i.e. log recovery will no longer be run on every mount until a RW mount is done). This is a user visible change in behaviour, but it is harmless. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
cd00158c |
|
18-Sep-2016 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: convert RUI log formats to use variable length arrays Use variable length array declarations for RUI log items, and replace the open coded sizeof formulae with a single function. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
738f57c1 |
|
25-Aug-2016 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: disallow mounting of realtime + rmap filesystems Since the kernel doesn't currently support the realtime rmapbt, don't allow such filesystems to be mounted. Support will appear in a future release. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
722e2517 |
|
02-Aug-2016 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: remove the extents array from the rmap update done log item Nothing ever uses the extent array in the rmap update done redo item, so remove it before it is fixed in the on-disk log format. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
1c0607ac |
|
02-Aug-2016 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: enable the rmap btree functionality Originally-From: Dave Chinner <dchinner@redhat.com> Add the feature flag to the supported matrix so that the kernel can mount and use rmap btree enabled filesystems Signed-off-by: Dave Chinner <dchinner@redhat.com> [darrick.wong@oracle.com: move the experimental tag] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
f8dbebef |
|
02-Aug-2016 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: enable the xfs_defer mechanism to process rmaps to update Connect the xfs_defer mechanism with the pieces that we'll need to handle deferred rmap updates. We'll wire up the existing code to our new deferred mechanism later. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
5880f2d7 |
|
02-Aug-2016 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: create rmap update intent log items Create rmap update intent/done log items to record redo information in the log. Because we need to roll transactions between updating the bmbt mapping and updating the reverse mapping, we also have to track the status of the metadata updates that will be recorded in the post-roll transactions, just in case we crash before committing the final transaction. This mechanism enables log recovery to finish what was already started. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
52548852 |
|
02-Aug-2016 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: rmap btree requires more reserved free space Originally-From: Dave Chinner <dchinner@redhat.com> The rmap btree is allocated from the AGFL, which means we have to ensure ENOSPC is reported to userspace before we run out of free space in each AG. The last allocation in an AG can cause a full height rmap btree split, and that means we have to reserve at least this many blocks *in each AG* to be placed on the AGFL at ENOSPC. Update the various space calculation functions to handle this. Also, because the macros are now executing conditional code and are called quite frequently, convert them to functions that initialise variables in the struct xfs_mount, use the new variables everywhere and document the calculations better. [darrick.wong@oracle.com: don't reserve blocks if !rmap] [dchinner@redhat.com: update m_ag_max_usable after growfs] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
310a75a3 |
|
02-Aug-2016 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: change xfs_bmap_{finish,cancel,init,free} -> xfs_defer_* Drop the compatibility shims that we were using to integrate the new deferred operation mechanism into the existing code. No new code. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
9749fee8 |
|
02-Aug-2016 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: enable the xfs_defer mechanism to process extents to free Connect the xfs_defer mechanism with the pieces that we'll need to handle deferred extent freeing. We'll wire up the existing code to our new deferred mechanism later. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
72ccbbe1 |
|
21-Jul-2016 |
Dave Chinner <dchinner@redhat.com> |
xfs: remove EXPERIMENTAL tag from sparse inode feature Been around for long enough now, hasn't caused any regression test failures in the past 3 months, so it's time to make it a fully supported feature. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
e66a4c67 |
|
20-Jun-2016 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: convert list of extents to free into a regular list In struct xfs_bmap_free, convert the open-coded free extent list to a regular list, then use list_sort to sort it prior to processing. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
fa5a4f57 |
|
20-Jun-2016 |
Brian Foster <bfoster@redhat.com> |
xfs: cancel eofblocks background trimming on remount read-only The filesystem quiesce sequence performs the operations necessary to drain all background work, push pending transactions through the log infrastructure and wait on I/O resulting from the final AIL push. We have had reports of remount,ro hangs in xfs_log_quiesce() -> xfs_wait_buftarg(), however, and some instrumentation code to detect transaction commits at this point in the quiesce sequence has inculpated the eofblocks background scanner as a cause. While higher level remount code generally prevents user modifications by the time the filesystem has made it to xfs_log_quiesce(), the background scanner may still be alive and can perform pending work at any time. If this occurs between the xfs_log_force() and xfs_wait_buftarg() calls within xfs_log_quiesce(), this can lead to an indefinite lockup in xfs_wait_buftarg(). To prevent this problem, cancel the background eofblocks scan worker during the remount read-only quiesce sequence. This suspends background trimming when a filesystem is remounted read-only. This is only done in the remount path because the freeze codepath has already locked out new transactions by the time the filesystem attempts to quiesce (and thus waiting on an active work item could deadlock). Kick the eofblocks worker to pick up where it left off once an fs is remounted back to read-write. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
0d5a75e9 |
|
01-Jun-2016 |
Eric Sandeen <sandeen@sandeen.net> |
xfs: make several functions static Al Viro noticed that xfs_lock_inodes should be static, and that led to ... a few more. These are just the easy ones, others require moving functions higher in source files, so that's not done here to keep this review simple. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
8179c036 |
|
17-May-2016 |
Dave Chinner <dchinner@redhat.com> |
xfs: remove xfs_fs_evict_inode() Joe Lawrence reported a list_add corruption with 4.6-rc1 when testing some custom md administration code that made it's own block device nodes for the md array. The simple test loop of: for i in {0..100}; do mknod --mode=0600 $tmp/tmp_node b $MAJOR $MINOR mdadm --detail --export $tmp/tmp_node > /dev/null rm -f $tmp/tmp_node done Would produce this warning in bd_acquire() when mdadm opened the device node: list_add double add: new=ffff88043831c7b8, prev=ffff8804380287d8, next=ffff88043831c7b8. And then produce this from bd_forget from kdevtmpfs evicting a block dev inode: list_del corruption. prev->next should be ffff8800bb83eb10, but was ffff88043831c7b8 This is a regression caused by commit c19b3b05 ("xfs: mode di_mode to vfs inode"). The issue is that xfs_inactive() frees the unlinked inode, and the above commit meant that this freeing zeroed the mode in the struct inode. The problem is that after evict() has called ->evict_inode, it expects the i_mode to be intact so that it can call bd_forget() or cd_forget() to drop the reference to the block device inode attached to the XFS inode. In reality, the only thing we do in xfs_fs_evict_inode() that is not generic is call xfs_inactive(). We can move the xfs_inactive() call to xfs_fs_destroy_inode() without any problems at all, and this will leave the VFS inode intact until it is completely done with it. So, remove xfs_fs_evict_inode(), and do the work it used to do in ->destroy_inode instead. cc: <stable@vger.kernel.org> # 4.6 Reported-by: Joe Lawrence <joe.lawrence@stratus.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
1e937cdd |
|
10-May-2016 |
Toshi Kani <toshi.kani@hpe.com> |
xfs: Add alignment check for DAX mount When a partition is not aligned by 4KB, mount -o dax succeeds, but any read/write access to the filesystem fails, except for metadata update. Call bdev_dax_supported() to perform proper precondition checks which includes this partition alignment check. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Boaz Harrosh <boaz@plexistor.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
|
#
0e51a8e1 |
|
05-Apr-2016 |
Christoph Hellwig <hch@lst.de> |
xfs: optimize bio handling in the buffer writeback path This patch implements two closely related changes: First it embeds a bio the ioend structure so that we don't have to allocate one separately. Second it uses the block layer bio chaining mechanism to chain additional bios off this first one if needed instead of manually accounting for multiple bio completions in the ioend structure. Together this removes a memory allocation per ioend and greatly simplifies the ioend setup and I/O completion path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
ce5c767d |
|
05-Apr-2016 |
Eryu Guan <guaneryu@gmail.com> |
xfs: add missing break in xfs_parseargs() Commit 2e74af0e1189 ("xfs: convert mount option parsing to tokens") missed a 'break;' in xfs_parseargs() which causes mount to fail with "-o pqnoenforce" option when mounting a v4 filesystem. xfs/050 catches this failure: XFS (vda6): Super block does not support project and group quota together Fixes: 2e74af0e1189 ("xfs: convert mount option parsing to tokens") Signed-off-by: Eryu Guan <guaneryu@gmail.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
d0a58e83 |
|
05-Apr-2016 |
Eric Sandeen <sandeen@redhat.com> |
xfs: disallow rw remount on fs with unknown ro-compat features Today, a kernel which refuses to mount a filesystem read-write due to unknown ro-compat features can still transition to read-write via the remount path. The old kernel is most likely none the wiser, because it's unaware of the new feature, and isn't using it. However, writing to the filesystem may well corrupt metadata related to that new feature, and moving to a newer kernel which understand the feature will have problems. Right now the only ro-compat feature we have is the free inode btree, which showed up in v3.16. It would be good to push this back to all the active stable kernels, I think, so that if anyone is using newer mkfs (which enables the finobt feature) with older kernel releases, they'll be protected. Cc: <stable@vger.kernel.org> # 3.10.x- Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
ea1754a0 |
|
01-Apr-2016 |
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage Mostly direct substitution with occasional adjustment or removing outdated comments. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
09cbfeaf |
|
01-Apr-2016 |
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
30cbc591 |
|
08-Mar-2016 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: check sizes of XFS on-disk structures at compile time Check the sizes of XFS on-disk structures when compiling the kernel. Use this to catch inadvertent changes in structure size due to padding and alignment issues, etc. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
12c3f05c |
|
01-Mar-2016 |
Eric Sandeen <sandeen@redhat.com> |
xfs: fix up inode32/64 (re)mount handling inode32/inode64 allocator behavior with respect to mount, remount and growfs is a little tricky. The inode32 mount option should only enable the inode32 allocator heuristics if the filesystem is large enough for 64-bit inodes to exist. Today, it has this behavior on the initial mount, but a remount with inode32 unconditionally changes the allocation heuristics, even for a small fs. Also, an inode32 mounted small filesystem should transition to the inode32 allocator if the filesystem is subsequently grown to a sufficient size. Today that does not happen. This patch consolidates xfs_set_inode32 and xfs_set_inode64 into a single new function, and moves the "is the maximum inode number big enough to matter" test into that function, so it doesn't rely on the caller to get it right - which remount did not do, previously. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
a08ee40a |
|
01-Mar-2016 |
Eric Sandeen <sandeen@sandeen.net> |
xfs: sanitize remount options Perform basic sanitization of remount options by passing the option string and a dummy mount structure through xfs_parseargs and returning the result. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
2e74af0e |
|
01-Mar-2016 |
Eric Sandeen <sandeen@sandeen.net> |
xfs: convert mount option parsing to tokens This should be a no-op change, just switch to token parsing like every other respectable filesystem does. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
5d097056 |
|
14-Jan-2016 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
kmemcg: account certain kmem allocations to memcg Mark those kmem allocations that are known to be easily triggered from userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to memcg. For the list, see below: - threadinfo - task_struct - task_delay_info - pid - cred - mm_struct - vm_area_struct and vm_region (nommu) - anon_vma and anon_vma_chain - signal_struct - sighand_struct - fs_struct - files_struct - fdtable and fdtable->full_fds_bits - dentry and external_name - inode for all filesystems. This is the most tedious part, because most filesystems overwrite the alloc_inode method. The list is far from complete, so feel free to add more objects. Nevertheless, it should be close to "account everything" approach and keep most workloads within bounds. Malevolent users will be able to breach the limit, but this was possible even with the former "account everything" approach (simply because it did not account everything in fact). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
a841b64d |
|
03-Jan-2016 |
Markus Elfring <elfring@users.sourceforge.net> |
XFS: Use a signed return type for suffix_kstrtoint() The return type "unsigned long" was used by the suffix_kstrtoint() function even though it will eventually return a negative error code. Improve this implementation detail by using the type "int" instead. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
7a29ac47 |
|
09-Nov-2015 |
Chris Mason <clm@fb.com> |
xfs: give all workqueues rescuer threads We're consistently hitting deadlocks here with XFS on recent kernels. After some digging through the crash files, it looks like everyone in the system is waiting for XFS to reclaim memory. Something like this: PID: 2733434 TASK: ffff8808cd242800 CPU: 19 COMMAND: "java" #0 [ffff880019c53588] __schedule at ffffffff818c4df2 #1 [ffff880019c535d8] schedule at ffffffff818c5517 #2 [ffff880019c535f8] _xfs_log_force_lsn at ffffffff81316348 #3 [ffff880019c53688] xfs_log_force_lsn at ffffffff813164fb #4 [ffff880019c536b8] xfs_iunpin_wait at ffffffff8130835e #5 [ffff880019c53728] xfs_reclaim_inode at ffffffff812fd453 #6 [ffff880019c53778] xfs_reclaim_inodes_ag at ffffffff812fd8c7 #7 [ffff880019c53928] xfs_reclaim_inodes_nr at ffffffff812fe433 #8 [ffff880019c53958] xfs_fs_free_cached_objects at ffffffff8130d3b9 #9 [ffff880019c53968] super_cache_scan at ffffffff811a6f73 #10 [ffff880019c539c8] shrink_slab at ffffffff811460e6 #11 [ffff880019c53aa8] shrink_zone at ffffffff8114a53f #12 [ffff880019c53b48] do_try_to_free_pages at ffffffff8114a8ba #13 [ffff880019c53be8] try_to_free_pages at ffffffff8114ad5a #14 [ffff880019c53c78] __alloc_pages_nodemask at ffffffff8113e1b8 #15 [ffff880019c53d88] alloc_kmem_pages_node at ffffffff8113e671 #16 [ffff880019c53dd8] copy_process at ffffffff8104f781 #17 [ffff880019c53ec8] do_fork at ffffffff8105129c #18 [ffff880019c53f38] sys_clone at ffffffff810515b6 #19 [ffff880019c53f48] stub_clone at ffffffff818c8e4d xfs_log_force_lsn is waiting for logs to get cleaned, which is waiting for IO, which is waiting for workers to complete the IO which is waiting for worker threads that don't exist yet: PID: 2752451 TASK: ffff880bd6bdda00 CPU: 37 COMMAND: "kworker/37:1" #0 [ffff8808d20abbb0] __schedule at ffffffff818c4df2 #1 [ffff8808d20abc00] schedule at ffffffff818c5517 #2 [ffff8808d20abc20] schedule_timeout at ffffffff818c7c6c #3 [ffff8808d20abcc0] wait_for_completion_killable at ffffffff818c6495 #4 [ffff8808d20abd30] kthread_create_on_node at ffffffff8106ec82 #5 [ffff8808d20abdf0] create_worker at ffffffff8106752f #6 [ffff8808d20abe40] worker_thread at ffffffff810699be #7 [ffff8808d20abec0] kthread at ffffffff8106ef59 #8 [ffff8808d20abf50] ret_from_fork at ffffffff818c8ac8 I think we should be using WQ_MEM_RECLAIM to make sure this thread pool makes progress when we're not able to allocate new workers. [dchinner: make all workqueues WQ_MEM_RECLAIM] Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
af3b6382 |
|
02-Nov-2015 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: don't leak uuid table on rmmod Don't leak the UUID table when the module is unloaded. (Found with kmemleak.) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
f9d460b3 |
|
18-Oct-2015 |
Dan Carpenter <dan.carpenter@oracle.com> |
xfs: fix an error code in xfs_fs_fill_super() If alloc_percpu() fails, we accidentally return PTR_ERR(NULL), which means success, but we intended to return -ENOMEM. Fixes: 225e4635580c ('xfs: per-filesystem stats in sysfs') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
ff6d6af2 |
|
12-Oct-2015 |
Bill O'Donnell <billodo@redhat.com> |
xfs: per-filesystem stats counter implementation This patch modifies the stats counting macros and the callers to those macros to properly increment, decrement, and add-to the xfs stats counts. The counts for global and per-fs stats are correctly advanced, and cleared by writing a "1" to the corresponding clear file. global counts: /sys/fs/xfs/stats/stats per-fs counts: /sys/fs/xfs/sda*/stats/stats global clear: /sys/fs/xfs/stats/stats_clear per-fs clear: /sys/fs/xfs/sda*/stats/stats_clear [dchinner: cleaned up macro variables, removed CONFIG_FS_PROC around stats structures and macros. ] Signed-off-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
225e4635 |
|
12-Oct-2015 |
Bill O'Donnell <billodo@redhat.com> |
xfs: per-filesystem stats in sysfs This patch implements per-filesystem stats objects in sysfs. It depends on the application of the previous patch series that develops the infrastructure to support both xfs global stats and xfs per-fs stats in sysfs. Stats objects are instantiated when an xfs filesystem is mounted and deleted on unmount. With this patch, the stats directory is created and populated with the familiar stats and stats_clear files. Example: /sys/fs/xfs/sda9/stats/stats /sys/fs/xfs/sda9/stats/stats_clear With this patch, the individual counts within the new per-fs stats file(s) remain at zero. Functions that use the the macros to increment, decrement, and add-to the per-fs stats counts will be covered in a separate new patch to follow this one. Note that the counts within the global stats file (/sys/fs/xfs/stats/stats) advance normally and can be cleared as it was prior to this patch. [dchinner: move setup/teardown to xfs_fs_{fill|put}_super() so it is down before/after any path that uses the per-mount stats. ] Signed-off-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
80529c45 |
|
11-Oct-2015 |
Bill O'Donnell <billodo@redhat.com> |
xfs: pass xfsstats structures to handlers and macros This patch is the next step toward per-fs xfs stats. The patch makes the show and clear routines able to handle any stats structure associated with a kobject. Instead of a single global xfsstats structure, add kobject and a pointer to a per-cpu struct xfsstats. Modify the macros that manipulate the stats accordingly: XFS_STATS_INC, XFS_STATS_DEC, and XFS_STATS_ADD now access xfsstats->xs_stats. The sysfs functions need to get from the kobject back to the xfsstats structure which contains it, and pass the pointer to the ->xs_stats percpu structure into the show & clear routines. Signed-off-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
bb230c12 |
|
11-Oct-2015 |
Bill O'Donnell <billodo@redhat.com> |
xfs: create global stats and stats_clear in sysfs Currently, xfs global stats are in procfs. This patch introduces (replicates) the global stats in sysfs. Additionally a stats_clear file is introduced in sysfs. Signed-off-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
a068acf2 |
|
04-Sep-2015 |
Kees Cook <keescook@chromium.org> |
fs: create and use seq_show_option for escaping Many file systems that implement the show_options hook fail to correctly escape their output which could lead to unescaped characters (e.g. new lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This could lead to confusion, spoofed entries (resulting in things like systemd issuing false d-bus "mount" notifications), and who knows what else. This looks like it would only be the root user stepping on themselves, but it's possible weird things could happen in containers or in other situations with delegated mount privileges. Here's an example using overlay with setuid fusermount trusting the contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use of "sudo" is something more sneaky: $ BASE="ovl" $ MNT="$BASE/mnt" $ LOW="$BASE/lower" $ UP="$BASE/upper" $ WORK="$BASE/work/ 0 0 none /proc fuse.pwn user_id=1000" $ mkdir -p "$LOW" "$UP" "$WORK" $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt $ cat /proc/mounts none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0 none /proc fuse.pwn user_id=1000 0 0 $ fusermount -u /proc $ cat /proc/mounts cat: /proc/mounts: No such file or directory This fixes the problem by adding new seq_show_option and seq_show_option_n helpers, and updating the vulnerable show_option handlers to use them as needed. Some, like SELinux, need to be open coded due to unusual existing escape mechanisms. [akpm@linux-foundation.org: add lost chunk, per Kees] [keescook@chromium.org: seq_show_option should be using const parameters] Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Jan Kara <jack@suse.com> Acked-by: Paul Moore <paul@paul-moore.com> Cc: J. R. Okajima <hooanon05g@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
2ccf4a9b |
|
24-Aug-2015 |
Eric Sandeen <sandeen@redhat.com> |
xfs: collapse allocsize and biosize mount option handling The allocsize and biosize mount options are handled identically, other than allocsize accepting suffixes. suffix_kstrtoint handles bare numbers just fine too, so these can be collapsed. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
1b867d3a |
|
18-Aug-2015 |
Brian Foster <bfoster@redhat.com> |
xfs: relocate sparse inode mount warning The sparse inodes feature is currently considered experimental. We warn at mount time from xfs_mount_validate_sb(). This function is part of the superblock verifier codepath, however, which means it could be invoked repeatedly on superblock reads or writes. This is currently only noticeable from userspace, where mkfs produces multiple warnings at format time. As mkfs warnings were not the intent of this change, relocate the mount time warning to xfs_fs_fill_super(), which is only invoked once and only in kernel space. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
cbe4dab1 |
|
03-Jun-2015 |
Dave Chinner <dchinner@redhat.com> |
xfs: add initial DAX support Add initial DAX support to XFS. To do this we need a new mount option to turn DAX on filesystem, and we need to propagate this into the inode flags whenever an inode is instantiated so that the per-inode checks throughout the code Do The Right Thing. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
2b0143b5 |
|
17-Mar-2015 |
David Howells <dhowells@redhat.com> |
VFS: normal filesystems (and lustre): d_inode() annotations that's the bulk of filesystem drivers dealing with inodes of their own Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
bbe051c8 |
|
12-Apr-2015 |
Eric Sandeen <sandeen@redhat.com> |
xfs: disallow ro->rw remount on norecovery mount There's a bit of a loophole in norecovery mount handling right now: an initial mount must be readonly, but nothing prevents a mount -o remount,rw from producing a writable, unrecovered xfs filesystem. It might be possible to try to perform a log recovery when this is requested, but I'm not sure it's worth the effort. For now, simply disallow this sort of transition. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
5e9383f9 |
|
24-Mar-2015 |
Joe Perches <joe@perches.com> |
xfs: Fix incorrect positive ENOMEM return added a positive error return value. This value filters up through the return layers and should be negative as the other return values are in the same function. Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
444a7022 |
|
23-Feb-2015 |
Eric Sandeen <sandeen@redhat.com> |
xfs: remove deprecated mount options We recently removed deprecated sysctls; may as well remove deprecated mount options as well, we've stated that they'd be gone by now in the docs. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
3b9ce795 |
|
23-Feb-2015 |
Eric Sandeen <sandeen@sandeen.net> |
xfs: log unmount events on console There are times, when doing triage and forensics, that we would like to know whether a filesystem was unmounted, or if the plug was pulled without a clean unmount. Log unmounts at the same level (NOTICE) as we log mounts. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
653c60b6 |
|
23-Feb-2015 |
Dave Chinner <dchinner@redhat.com> |
xfs: introduce mmap/truncate lock Right now we cannot serialise mmap against truncate or hole punch sanely. ->page_mkwrite is not able to take locks that the read IO path normally takes (i.e. the inode iolock) because that could result in lock inversions (read - iolock - page fault - page_mkwrite - iolock) and so we cannot use an IO path lock to serialise page write faults against truncate operations. Instead, introduce a new lock that is used *only* in the ->page_mkwrite path that is the equivalent of the iolock. The lock ordering in a page fault is i_mmaplock -> page lock -> i_ilock, and so in truncate we can i_iolock -> i_mmaplock and so lock out new write faults during the process of truncation. Because i_mmap_lock is outside the page lock, we can hold it across all the same operations we hold the i_iolock for. The only difference is that we never hold the i_mmaplock in the normal IO path and so do not ever have the possibility that we can page fault inside it. Hence there are no recursion issues on the i_mmap_lock and so we can use it to serialise page fault IO against inode modification operations that affect the IO path. This patch introduces the i_mmaplock infrastructure, lockdep annotations and initialisation/destruction code. Use of the new lock will be in subsequent patches. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
5681ca40 |
|
23-Feb-2015 |
Dave Chinner <dchinner@redhat.com> |
xfs: Remove icsb infrastructure Now that the in-core superblock infrastructure has been replaced with generic per-cpu counters, we don't need it anymore. Nuke it from orbit so we are sure that it won't haunt us again... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
0d485ada |
|
23-Feb-2015 |
Dave Chinner <david@fromorbit.com> |
xfs: use generic percpu counters for free block counter XFS has hand-rolled per-cpu counters for the superblock since before there was any generic implementation. The free block counter is special in that it is used for ENOSPC detection outside transaction contexts for for delayed allocation. This means that the counter needs to be accurate at zero. The current per-cpu counter code jumps through lots of hoops to ensure we never run past zero, but we don't need to make all those jumps with the generic counter implementation. The generic counter implementation allows us to pass a "batch" threshold at which the addition/subtraction to the counter value will be folded back into global value under lock. We can use this feature to reduce the batch size as we approach 0 in a very similar manner to the existing counters and their rebalance algorithm. If we use a batch size of 1 as we approach 0, then every addition and subtraction will be done against the global value and hence allow accurate detection of zero threshold crossing. Hence we can replace the handrolled, accurate-at-zero counters with generic percpu counters. Note: this removes just enough of the icsb infrastructure to compile without warnings. The rest will go in subsequent commits. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
e88b64ea |
|
23-Feb-2015 |
Dave Chinner <dchinner@redhat.com> |
xfs: use generic percpu counters for free inode counter XFS has hand-rolled per-cpu counters for the superblock since before there was any generic implementation. The free inode counter is not used for any limit enforcement - the per-AG free inode counters are used during allocation to determine if there are inode available for allocation. Hence we don't need any of the complexity of the hand-rolled counters and we can simply replace them with generic per-cpu counters similar to the inode counter. This version introduces a xfs_mod_ifree() helper function from Christoph Hellwig. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
501ab323 |
|
23-Feb-2015 |
Dave Chinner <david@fromorbit.com> |
xfs: use generic percpu counters for inode counter XFS has hand-rolled per-cpu counters for the superblock since before there was any generic implementation. There are some warts around the use of them for the inode counter as the hand rolled counter is designed to be accurate at zero, but has no specific accurracy at any other value. This design causes problems for the maximum inode count threshold enforcement, as there is no trigger that balances the counters as they get close tothe maximum threshold. Instead of designing new triggers for balancing, just replace the handrolled per-cpu counter with a generic counter. This enables us to update the counter through the normal superblock modification funtions, but rather than do that we add a xfs_mod_icount() helper function (from Christoph Hellwig) and keep the percpu counter outside the superblock in the struct xfs_mount. This means we still need to initialise the per-cpu counter specifically when we read the superblock, and vice versa when we log/write it, but it does mean that we don't need to change any other code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
4101b624 |
|
12-Feb-2015 |
Vladimir Davydov <vdavydov.dev@gmail.com> |
fs: consolidate {nr,free}_cached_objects args in shrink_control We are going to make FS shrinkers memcg-aware. To achieve that, we will have to pass the memcg to scan to the nr_cached_objects and free_cached_objects VFS methods, which currently take only the NUMA node to scan. Since the shrink_control structure already holds the node, and the memcg to scan will be added to it when we introduce memcg-aware vmscan, let us consolidate the methods' arguments in this structure to keep things clean. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Suggested-by: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Glauber Costa <glommer@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
01f9882e |
|
05-Feb-2015 |
Eric Sandeen <sandeen@redhat.com> |
xfs: report proper f_files in statfs if we overshoot imaxpct Normally, a statfs syscall reports m_maxicount as f_files (total file nodes in file system) because it is supposed to be the upper limit for dynamically-allocated inodes. It's possible, however, to overshoot imaxpct / m_maxicount. If this happens, we should report the actual number of allocated inodes, which is contained in sb_icount. Add one more adjustment to the statfs code to make this happen. Reported-by: Alexander Tsvetkov <alexander.tsvetkov@oracle.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
61e63ecb |
|
21-Jan-2015 |
Dave Chinner <dchinner@redhat.com> |
xfs: consolidate superblock logging functions We now have several superblock loggin functions that are identical except for the transaction reservation and whether it shoul dbe a synchronous transaction or not. Consolidate these all into a single function, a single reserveration and a sync flag and call it xfs_sync_sb(). Also, xfs_mod_sb() is not really a modification function - it's the operation of logging the superblock buffer. hence change the name of it to reflect this. Note that we have to change the mp->m_update_flags that are passed around at mount time to a boolean simply to indicate a superblock update is needed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
4d11a402 |
|
21-Jan-2015 |
Dave Chinner <dchinner@redhat.com> |
xfs: remove bitfield based superblock updates When we log changes to the superblock, we first have to write them to the on-disk buffer, and then log that. Right now we have a complex bitfield based arrangement to only write the modified field to the buffer before we log it. This used to be necessary as a performance optimisation because we logged the superblock buffer in every extent or inode allocation or freeing, and so performance was extremely important. We haven't done this for years, however, ever since the lazy superblock counters pulled the superblock logging out of the transaction commit fast path. Hence we have a bunch of complexity that is not necessary that makes writing the in-core superblock to disk much more complex than it needs to be. We only need to log the superblock now during management operations (e.g. during mount, unmount or quota control operations) so it is not a performance critical path anymore. As such, remove the complex field based logging mechanism and replace it with a simple conversion function similar to what we use for all other on-disk structures. This means we always log the entirity of the superblock, but again because we rarely modify the superblock this is not an issue for log bandwidth or CPU time. Indeed, if we do log the superblock frequently, delayed logging will minimise the impact of this overhead. [Fixed gquota/pquota inode sharing regression noticed by bfoster.] Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
77af574e |
|
23-Dec-2014 |
Eric Sandeen <sandeen@redhat.com> |
xfs: remove extra newlines from xfs messages xfs_warn() and friends add a newline by default, but some messages add another one. Particularly for the failing write message below, this can waste a lot of console real estate! Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
b29c70f5 |
|
03-Dec-2014 |
Brian Foster <bfoster@redhat.com> |
xfs: split metadata and log buffer completion to separate workqueues XFS traditionally sends all buffer I/O completion work to a single workqueue. This includes metadata buffer completion and log buffer completion. The log buffer completion requires a high priority queue to prevent stalls due to log forces getting stuck behind other queued work. Rather than continue to prioritize all buffer I/O completion due to the needs of log completion, split log buffer completion off to m_log_workqueue and move the high priority flag from m_buf_workqueue to m_log_workqueue. Add a b_ioend_wq wq pointer to xfs_buf to allow completion workqueue customization on a per-buffer basis. Initialize b_ioend_wq to m_buf_workqueue by default in the generic buffer I/O submission path. Finally, override the default wq with the high priority m_log_workqueue in the log buffer I/O submission path. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
cdc9cec7 |
|
03-Dec-2014 |
Dave Chinner <dchinner@redhat.com> |
xfs: active inodes stat is broken vn_active only ever gets decremented, so it has a very large negative number. Make it track the inode count we currently have allocated properly so we can easily track the size of the inode cache via tools like PCP. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
d2a5e3c6 |
|
30-Nov-2014 |
Markus Elfring <elfring@users.sourceforge.net> |
xfs: remove unnecessary null checks The functions xfs_blkdev_put() and xfs_qm_dqrele() test whether their argument is NULL and then return immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
508b6b3b |
|
27-Nov-2014 |
Christoph Hellwig <hch@lst.de> |
xfs: merge xfs_inum.h into xfs_format.h Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
4fb6e8ad |
|
27-Nov-2014 |
Christoph Hellwig <hch@lst.de> |
xfs: merge xfs_ag.h into xfs_format.h More on-disk format consolidation. A few declarations that weren't on-disk format related move into better suitable spots. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
6d3ebaae |
|
27-Nov-2014 |
Christoph Hellwig <hch@lst.de> |
xfs: merge xfs_dinode.h into xfs_format.h More consolidatation for the on-disk format defintions. Note that the XFS_IS_REALTIME_INODE moves to xfs_linux.h instead as it is not related to the on disk format, but depends on a CONFIG_ option. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
78c931b8 |
|
27-Nov-2014 |
Brian Foster <bfoster@redhat.com> |
xfs: replace global xfslogd wq with per-mount wq The xfslogd workqueue is a global, single-job workqueue for buffer ioend processing. This means we allow for a single work item at a time for all possible XFS mounts on a system. fsstress testing in loopback XFS over XFS configurations has reproduced xfslogd deadlocks due to the single threaded nature of the queue and dependencies introduced between the separate XFS instances by online discard (-o discard). Discard over a loopback device converts the discard request to a hole punch (fallocate) on the underlying file. Online discard requests are issued synchronously and from xfslogd context in XFS, hence the xfslogd workqueue is blocked in the upper fs waiting on a hole punch request to be servied in the lower fs. If the lower fs issues I/O that depends on xfslogd to complete, both filesystems end up hung indefinitely. This is reproduced reliabily by generic/013 on XFS->loop->XFS test devices with the '-o discard' mount option. Further, docker implementations appear to use this kind of configuration for container instance filesystems by default (container fs->dm-> loop->base fs) and therefore are subject to this deadlock when running on XFS. Replace the global xfslogd workqueue with a per-mount variant. This guarantees each mount access to a single worker and prevents deadlocks due to inter-fs dependencies introduced by discard. Since the queue is only responsible for buffer iodone processing at this point in time, rename xfslogd to xfs-buf. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
17ef4fdd |
|
30-Sep-2014 |
Jan Kara <jack@suse.cz> |
xfs: Set allowed quota types We support user, group, and project quotas. Tell VFS about it. CC: xfs@oss.sgi.com CC: Dave Chinner <david@fromorbit.com> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
|
#
e3aed1a0 |
|
28-Sep-2014 |
Dave Chinner <dchinner@redhat.com> |
xfs: xfs_kset should be static As it is accessed through the struct xfs_mount and can be set up entirely from fs/xfs/xfs_super.c Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
65b65735 |
|
08-Sep-2014 |
Brian Foster <bfoster@redhat.com> |
xfs: add debug sysfs attribute set Create a top-level debug directory for global debug sysfs attributes. This directory is added and removed on XFS module initialization and removal respectively for DEBUG mode kernels only. It typically resides at /sys/fs/xfs/debug. It is located at the top level of the xfs sysfs hierarchy as attributes might define global behavior or behavior that must be configured before an xfs mount is available (e.g., log recovery behavior). Define the global debug kobject that represents the debug sysfs directory and add generic attribute show/store helpers to support future attributes. No debug attributes are exported as of yet. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
8018ec08 |
|
08-Sep-2014 |
Brian Foster <bfoster@redhat.com> |
xfs: mark all internal workqueues as freezable Workqueues must be explicitly set as freezable to ensure they are frozen in the assocated part of the hibernation/suspend sequence. Freezing of workqueues and kernel threads is important to ensure that modifications are not made on-disk after the hibernation image has been created. Otherwise, the in-memory state can become inconsistent with what is on disk and eventually lead to filesystem corruption. We have reports of free space btree corruptions that occur immediately after restore from hibernate that suggest the xfs-eofblocks workqueue could be causing such problems if it races with hibernation. Mark all of the internal XFS workqueues as freezable to ensure nothing changes on-disk once the freezer infrastructure freezes kernel threads and creates the hibernation image. Signed-off-by: Brian Foster <bfoster@redhat.com> Reported-by: Carlos E. R. <carlos.e.r@opensuse.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
d5cf09ba |
|
29-Jul-2014 |
Christoph Hellwig <hch@lst.de> |
xfs: require 64-bit sector_t Trying to support tiny disks only and saving a bit memory might have made sense on an SGI O2 15 years ago, but is pretty pointless today. Remove the rarely tested codepath that uses various smaller in-memory types to reduce our test matrix and make the codebase a little bit smaller and less complicated. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
54aa61f8 |
|
24-Jul-2014 |
Eric Sandeen <sandeen@sandeen.net> |
xfs: tidy up xfs_set_inode32 xfs_set_inode32() caught my eye because it had weird spacing around the "-1's". In cleaning that up, I realized that the assignment in the declaration of "ino" is never used; it's rewritten before it gets read. Drop the ino initializer from its declaration since it's not used, and move the agino initialization into the body of the function, mostly so that we can have pretty whitespace and not exceed 80 columns. :) Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
9de67c3b |
|
24-Jul-2014 |
Eric Sandeen <sandeen@redhat.com> |
xfs: allow inode allocations in post-growfs disk space Today, if we perform an xfs_growfs which adds allocation groups, mp->m_maxagi is not properly updated when the growfs is complete. Therefore inodes will continue to be allocated only in the AGs which existed prior to the growfs, and the new space won't be utilized. This is because of this path in xfs_growfs_data_private(): xfs_growfs_data_private xfs_initialize_perag(mp, nagcount, &nagimax); if (mp->m_flags & XFS_MOUNT_32BITINODES) index = xfs_set_inode32(mp); else index = xfs_set_inode64(mp); if (maxagi) *maxagi = index; where xfs_set_inode* iterates over the (old) agcount in mp->m_sb.sb_agblocks, which has not yet been updated in the growfs path. So "index" will be returned based on the old agcount, not the new one, and new AGs are not available for inode allocation. Fix this by explicitly passing the proper AG count (which xfs_initialize_perag() already has) down another level, so that xfs_set_inode* can make the proper decision about acceptable AGs for inode allocation in the potentially newly-added AGs. This has been broken since 3.7, when these two xfs_set_inode* functions were added in commit 2d2194f. Prior to that, we looped over "agcount" not sb_agblocks in these calculations. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
3d871226 |
|
14-Jul-2014 |
Brian Foster <bfoster@redhat.com> |
xfs: add a sysfs kset Create a sysfs kset to contain all sub-objects associated with the XFS module. The kset is created and removed on module initialization and removal respectively. The kset uses fs_obj as a parent. This leads to the creation of a /sys/fs/xfs directory when the kset exists. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
2451337d |
|
24-Jun-2014 |
Dave Chinner <dchinner@redhat.com> |
xfs: global error sign conversion Convert all the errors the core XFs code to negative error signs like the rest of the kernel and remove all the sign conversion we do in the interface layers. Errors for conversion (and comparison) found via searches like: $ git grep " E" fs/xfs $ git grep "return E" fs/xfs $ git grep " E[A-Z].*;$" fs/xfs Negation points found via searches like: $ git grep "= -[a-z,A-Z]" fs/xfs $ git grep "return -[a-z,A-D,F-Z]" fs/xfs $ git grep " -[a-z].*;" fs/xfs [ with some bits I missed from Brian Foster ] Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
b474c7ae |
|
21-Jun-2014 |
Eric Sandeen <sandeen@sandeen.net> |
xfs: Nuke XFS_ERROR macro XFS_ERROR was designed long ago to trap return values, but it's not runtime configurable, it's not consistently used, and we can do similar error trapping with ftrace scripts and triggers from userspace. Just nuke XFS_ERROR and associated bits. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
bc147822 |
|
14-May-2014 |
Dave Chinner <dchinner@redhat.com> |
xfs: negate xfs_icsb_init_counters error value Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
45687642 |
|
14-May-2014 |
Dave Chinner <dchinner@redhat.com> |
xfs: negate mount workqueue init error value Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
1919adda |
|
22-Apr-2014 |
Christoph Hellwig <hch@lst.de> |
xfs: don't create a slab cache for filestream items We only have very few of these around, and allocation isn't that much of a hot path. Remove the slab cache to simplify the code, and to not waste any resources for the usual case of not having any inodes that use the filestream allocator. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
34dcefd7 |
|
14-Apr-2014 |
Eric Sandeen <sandeen@redhat.com> |
xfs: remove unused args from xfs_alloc_buftarg() Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
a96c4151 |
|
14-Apr-2014 |
Eric Sandeen <sandeen@redhat.com> |
xfs: remove unused blocksize arg from xfs_setsize_buftarg() Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
91b0abe3 |
|
03-Apr-2014 |
Johannes Weiner <hannes@cmpxchg.org> |
mm + fs: store shadow entries in page cache Reclaim will be leaving shadow entries in the page cache radix tree upon evicting the real page. As those pages are found from the LRU, an iput() can lead to the inode being freed concurrently. At this point, reclaim must no longer install shadow pages because the inode freeing code needs to ensure the page tree is really empty. Add an address_space flag, AS_EXITING, that the inode freeing code sets under the tree lock before doing the final truncate. Reclaim will check for this flag before installing shadow pages. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
02b9984d |
|
13-Mar-2014 |
Theodore Ts'o <tytso@mit.edu> |
fs: push sync_filesystem() down to the file system's remount_fs() Previously, the no-op "mount -o mount /dev/xxx" operation when the file system is already mounted read-write causes an implied, unconditional syncfs(). This seems pretty stupid, and it's certainly documented or guaraunteed to do this, nor is it particularly useful, except in the case where the file system was mounted rw and is getting remounted read-only. However, it's possible that there might be some file systems that are actually depending on this behavior. In most file systems, it's probably fine to only call sync_filesystem() when transitioning from read-write to read-only, and there are some file systems where this is not needed at all (for example, for a pseudo-filesystem or something like romfs). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Jan Kara <jack@suse.cz> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Anders Larsen <al@alarsen.net> Cc: Phillip Lougher <phillip@squashfs.org.uk> Cc: Kees Cook <keescook@chromium.org> Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Cc: Petr Vandrovec <petr@vandrovec.name> Cc: xfs@oss.sgi.com Cc: linux-btrfs@vger.kernel.org Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Cc: codalist@coda.cs.cmu.edu Cc: linux-ext4@vger.kernel.org Cc: linux-f2fs-devel@lists.sourceforge.net Cc: fuse-devel@lists.sourceforge.net Cc: cluster-devel@redhat.com Cc: linux-mtd@lists.infradead.org Cc: jfs-discussion@lists.sourceforge.net Cc: linux-nfs@vger.kernel.org Cc: linux-nilfs@vger.kernel.org Cc: linux-ntfs-dev@lists.sourceforge.net Cc: ocfs2-devel@oss.oracle.com Cc: reiserfs-devel@vger.kernel.org
|
#
0dc83bd3 |
|
21-Feb-2014 |
Jan Kara <jack@suse.cz> |
Revert "writeback: do not sync data dirtied after sync start" This reverts commit c4a391b53a72d2df4ee97f96f78c1d5971b47489. Dave Chinner <david@fromorbit.com> has reported the commit may cause some inodes to be left out from sync(2). This is because we can call redirty_tail() for some inode (which sets i_dirtied_when to current time) after sync(2) has started or similarly requeue_inode() can set i_dirtied_when to current time if writeback had to skip some pages. The real problem is in the functions clobbering i_dirtied_when but fixing that isn't trivial so revert is a safer choice for now. CC: stable@vger.kernel.org # >= 3.13 Signed-off-by: Jan Kara <jack@suse.cz>
|
#
c4a391b5 |
|
12-Nov-2013 |
Jan Kara <jack@suse.cz> |
writeback: do not sync data dirtied after sync start When there are processes heavily creating small files while sync(2) is running, it can easily happen that quite some new files are created between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2). That can happen especially if there are several busy filesystems (remember that sync traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one fs before starting it on another fs). Because WB_SYNC_ALL pass is slow (e.g. causes a transaction commit and cache flush for each inode in ext3), resulting sync(2) times are rather large. The following script reproduces the problem: function run_writers { for (( i = 0; i < 10; i++ )); do mkdir $1/dir$i for (( j = 0; j < 40000; j++ )); do dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null done & done } for dir in "$@"; do run_writers $dir done sleep 40 time sync Fix the problem by disregarding inodes dirtied after sync(2) was called in the WB_SYNC_ALL pass. To allow for this, sync_inodes_sb() now takes a time stamp when sync has started which is used for setting up work for flusher threads. To give some numbers, when above script is run on two ext4 filesystems on simple SATA drive, the average sync time from 10 runs is 267.549 seconds with standard deviation 104.799426. With the patched kernel, the average sync time from 10 runs is 2.995 seconds with standard deviation 0.096. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Fengguang Wu <fengguang.wu@intel.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
632b89e8 |
|
29-Oct-2013 |
Dave Chinner <dchinner@redhat.com> |
xfs: fix static and extern sparse warnings The kbuild test robot indicated that there were some new sparse warnings in fs/xfs/xfs_dquot_buf.c. Actually, there were a lot more that is wasn't warning about, so fix them all up. Reported-by: kbuild test robot Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
a4fbe6ab |
|
22-Oct-2013 |
Dave Chinner <dchinner@redhat.com> |
xfs: decouple inode and bmap btree header files Currently the xfs_inode.h header has a dependency on the definition of the BMAP btree records as the inode fork includes an array of xfs_bmbt_rec_host_t objects in it's definition. Move all the btree format definitions from xfs_btree.h, xfs_bmap_btree.h, xfs_alloc_btree.h and xfs_ialloc_btree.h to xfs_format.h to continue the process of centralising the on-disk format definitions. With this done, the xfs inode definitions are no longer dependent on btree header files. The enables a massive culling of unnecessary includes, with close to 200 #include directives removed from the XFS kernel code base. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
239880ef |
|
22-Oct-2013 |
Dave Chinner <dchinner@redhat.com> |
xfs: decouple log and transaction headers xfs_trans.h has a dependency on xfs_log.h for a couple of structures. Most code that does transactions doesn't need to know anything about the log, but this dependency means that they have to include xfs_log.h. Decouple the xfs_trans.h and xfs_log.h header files and clean up the includes to be in dependency order. In doing this, remove the direct include of xfs_trans_reserve.h from xfs_trans.h so that we remove the dependency between xfs_trans.h and xfs_mount.h. Hence the xfs_trans.h include can be moved to the indicate the actual dependencies other header files have on it. Note that these are kernel only header files, so this does not translate to any userspace changes at all. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
57062787 |
|
14-Oct-2013 |
Dave Chinner <dchinner@redhat.com> |
xfs: unify directory/attribute format definitions The on-disk format definitions for the directory and attribute structures are spread across 3 header files right now, only one of which is dedicated to defining on-disk structures and their manipulation (xfs_dir2_format.h). Pull all the format definitions into a single header file - xfs_da_format.h - and switch all the code over to point at that. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
70a9883c |
|
22-Oct-2013 |
Dave Chinner <dchinner@redhat.com> |
xfs: create a shared header file for format-related information All of the buffer operations structures are needed to be exported for xfs_db, so move them all to a common location rather than spreading them all over the place. They are verifying the on-disk format, so while xfs_format.h might be a good place, it is not part of the on disk format. Hence we need to create a new header file that we centralise these related definitions. Start by moving the bffer operations structures, and then also move all the other definitions that have crept into xfs_log_format.h and xfs_format.h as there was no other shared header file to put them in. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
08e96e1a |
|
11-Oct-2013 |
Eric Sandeen <sandeen@sandeen.net> |
xfs: remove newlines from strings passed to __xfs_printk __xfs_printk adds its own "\n". Having it in the original string leads to unintentional blank lines from these messages. Most format strings have no newline, but a few do, leading to i.e.: [ 7347.119911] XFS (sdb2): Access to block zero in inode 132 start_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: 1a05 [ 7347.119911] [ 7347.119919] XFS (sdb2): Access to block zero in inode 132 start_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: 1a05 [ 7347.119919] Fix them all. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
d948709b |
|
10-Sep-2013 |
Ben Myers <bpm@sgi.com> |
xfs: remove usage of is_bad_inode XFS never calls mark_inode_bad or iget_failed, so it will never see a bad inode. Remove all checks for is_bad_inode because they are unnecessary. Signed-off-by: Ben Myers <bpm@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
9b17c623 |
|
27-Aug-2013 |
Dave Chinner <dchinner@redhat.com> |
fs: convert inode and dentry shrinking to be node aware Now that the shrinker is passing a node in the scan control structure, we can pass this to the the generic LRU list code to isolate reclaim to the lists on matching nodes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
0a234c6d |
|
27-Aug-2013 |
Dave Chinner <dchinner@redhat.com> |
shrinker: convert superblock shrinkers to new API Convert superblock shrinker to use the new count/scan API, and propagate the API changes through to the filesystem callouts. The filesystem callouts already use a count/scan API, so it's just changing counters to longs to match the VM API. This requires the dentry and inode shrinker callouts to be converted to the count/scan API. This is mainly a mechanical change. [glommer@openvz.org: use mult_frac for fractional proportions, build fixes] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Glauber Costa <glommer@openvz.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
e546cb79 |
|
12-Aug-2013 |
Dave Chinner <dchinner@redhat.com> |
xfs: consolidate xfs_utils.c There are a few small helper functions in xfs_util, all related to xfs_inode modifications. Move them all to xfs_inode.c so all xfs_inode operations are consiolidated in the one place. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
c24b5dfa |
|
12-Aug-2013 |
Dave Chinner <dchinner@redhat.com> |
xfs: kill xfs_vnodeops.[ch] Now we have xfs_inode.c for holding kernel-only XFS inode operations, move all the inode operations from xfs_vnodeops.c to this new file as it holds another set of kernel-only inode operations. The name of this file traces back to the days of Irix and it's vnodes which we don't have anymore. Essentially this move consolidates the inode locking functions and a bunch of XFS inode operations into the one file. Eventually the high level functions will be merged into the VFS interface functions in xfs_iops.c. This leaves only internal preallocation, EOF block manipulation and hole punching functions in vnodeops.c. Move these to xfs_bmap_util.c where we are already consolidating various in-kernel physical extent manipulation and querying functions. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
2b9ab5ab |
|
12-Aug-2013 |
Dave Chinner <dchinner@redhat.com> |
xfs: reshuffle dir2 definitions around for userspace Many of the definitions within xfs_dir2_priv.h are needed in userspace outside libxfs. Definitions within xfs_dir2_priv.h are wholly contained within libxfs, so we need to shuffle some of the definitions around to keep consistency across files shared between user and kernel space. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
6ca1c906 |
|
12-Aug-2013 |
Dave Chinner <dchinner@redhat.com> |
xfs: separate dquot on disk format definitions out of xfs_quota.h The on disk format definitions of the on-disk dquot, log formats and quota off log formats are all intertwined with other definitions for quotas. Separate them out into their own header file so they can easily be shared with userspace. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
7a378c9a |
|
30-Jul-2013 |
Tejun Heo <tj@kernel.org> |
xfs: WQ_NON_REENTRANT is meaningless and going away dbf2576e37 ("workqueue: make all workqueues non-reentrant") made WQ_NON_REENTRANT no-op and the flag is going away. Remove its usages. This patch doesn't introduce any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ben Myers <bpm@sgi.com> Cc: Alex Elder <elder@kernel.org> Cc: xfs@oss.sgi.com Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
d892d586 |
|
19-Jul-2013 |
Chandra Seetharaman <sekharan@us.ibm.com> |
xfs: Start using pquotaino from the superblock. Start using pquotino and define a macro to check if the superblock has pquotino. Keep backward compatibilty by alowing mount of older superblock with no separate pquota inode. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
83e782e1 |
|
27-Jun-2013 |
Chandra Seetharaman <sekharan@us.ibm.com> |
xfs: Remove incore use of XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKD Remove all incore use of XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKD. Instead, start using XFS_GQUOTA_.* XFS_PQUOTA_.* counterparts for GQUOTA and PQUOTA respectively. On-disk copy still uses XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKD. Read and write of the superblock does the conversion from *OQUOTA* to *[PG]QUOTA*. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
dc037ad7 |
|
27-Jun-2013 |
Dave Chinner <dchinner@redhat.com> |
xfs: implement inode change count For CRC enabled filesystems, add support for the monotonic inode version change counter that is needed by protocols like NFSv4 for determining if the inode has changed in any way at all between two unrelated operations on the inode. This bumps the change count the first time an inode is dirtied in a transaction. Since all modifications to the inode are logged, this will catch all changes that are made to the inode, including timestamp updates that occur during data writes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
3ebe7d2d |
|
27-Jun-2013 |
Dave Chinner <david@fromorbit.com> |
xfs: Inode create log items Introduce the inode create log item type for logical inode create logging. Instead of logging the changes in buffers, pass the range to be initialised through the log by a new transaction type. This reduces the amount of log space required to record initialisation during allocation from about 128 bytes per inode to a small fixed amount per inode extent to be initialised. This requires a new log item type to track it through the log and the AIL. This is a relatively simple item - most callbacks are noops as this item has the same life cycle as the transaction. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
39a45d84 |
|
02-May-2013 |
Jie Liu <jeff.liu@oracle.com> |
xfs: Remove XFS_MOUNT_RETERR XFS_MOUNT_RETERR is going to be set at xfs_parseargs() if mp->m_dalign is enabled, so any time we enter "if (mp->m_dalign)" branch in xfs_update_alignment(), XFS_MOUNT_RETERR is set and so we always be emitting a warning and returning an error. Hence, we can remove it and get rid of a couple of redundant check up against it at xfs_upate_alignment(). Thanks Dave Chinner for the suggestions of simplify the code in xfs_parseargs(). Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
f763fd44 |
|
04-Jun-2013 |
Dave Chinner <dchinner@redhat.com> |
xfs: disable noattr2/attr2 mount options for CRC enabled filesystems attr2 format is always enabled for v5 superblock filesystems, so the mount options to enable or disable it need to be cause mount errors. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit d3eaace84e40bf946129e516dcbd617173c1cf14)
|
#
d3eaace8 |
|
04-Jun-2013 |
Dave Chinner <dchinner@redhat.com> |
xfs: disable noattr2/attr2 mount options for CRC enabled filesystems attr2 format is always enabled for v5 superblock filesystems, so the mount options to enable or disable it need to be cause mount errors. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
7f78e035 |
|
02-Mar-2013 |
Eric W. Biederman <ebiederm@xmission.com> |
fs: Limit sys_mount to only request filesystem modules. Modify the request_module to prefix the file system type with "fs-" and add aliases to all of the filesystems that can be built as modules to match. A common practice is to build all of the kernel code and leave code that is not commonly needed as modules, with the result that many users are exposed to any bug anywhere in the kernel. Looking for filesystems with a fs- prefix limits the pool of possible modules that can be loaded by mount to just filesystems trivially making things safer with no real cost. Using aliases means user space can control the policy of which filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf with blacklist and alias directives. Allowing simple, safe, well understood work-arounds to known problematic software. This also addresses a rare but unfortunate problem where the filesystem name is not the same as it's module name and module auto-loading would not work. While writing this patch I saw a handful of such cases. The most significant being autofs that lives in the module autofs4. This is relevant to user namespaces because we can reach the request module in get_fs_type() without having any special permissions, and people get uncomfortable when a user specified string (in this case the filesystem type) goes all of the way to request_module. After having looked at this issue I don't think there is any particular reason to perform any filtering or permission checks beyond making it clear in the module request that we want a filesystem module. The common pattern in the kernel is to call request_module() without regards to the users permissions. In general all a filesystem module does once loaded is call register_filesystem() and go to sleep. Which means there is not much attack surface exposed by loading a filesytem module unless the filesystem is mounted. In a user namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT, which most filesystems do not set today. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Kees Cook <keescook@chromium.org> Reported-by: Kees Cook <keescook@google.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
#
a17164e5 |
|
09-Jan-2013 |
Abhijit Pawar <abhi.c.pawar@gmail.com> |
fs/xfs remove obsolete simple_strto<foo> This patch replaces usages of obsolete simple_strtoul with kstrtoint in xfs_args and suffix_strtoul. Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
579b62fa |
|
06-Nov-2012 |
Brian Foster <bfoster@redhat.com> |
xfs: add background scanning to clear eofblocks inodes Create a new mount workqueue and delayed_work to enable background scanning and freeing of eofblocks inodes. The scanner kicks in once speculative preallocation occurs and stops requeueing itself when no eofblocks inodes exist. The scan interval is based on the new 'speculative_prealloc_lifetime' tunable (default to 5m). The background scanner performs unfiltered, best effort scans (which skips inodes under lock contention or with a dirty cache mapping). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
6d8b79cf |
|
08-Oct-2012 |
Dave Chinner <dchinner@redhat.com> |
xfs: rename xfs_sync.[ch] to xfs_icache.[ch] xfs_sync.c now only contains inode reclaim functions and inode cache iteration functions. It is not related to sync operations anymore. Rename to xfs_icache.c to reflect it's contents and prepare for consolidation with the other inode cache file that exists (xfs_iget.c). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
c75921a7 |
|
08-Oct-2012 |
Dave Chinner <dchinner@redhat.com> |
xfs: xfs_quiesce_attr() should quiesce the log like unmount xfs_quiesce_attr() is supposed to leave the log empty with an unmount record written. Right now it does not wait for the AIL to be emptied before writing the unmount record, not does it wait for metadata IO completion, either. Fix it to use the same method and code as xfs_log_unmount(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
c7eea6f7 |
|
08-Oct-2012 |
Dave Chinner <dchinner@redhat.com> |
xfs: move xfs_quiesce_attr() into xfs_super.c Both callers of xfs_quiesce_attr() are in xfs_super.c, and there's nothing really sync-specific about this functionality so it doesn't really matter where it lives. Move it to benext to it's callers, so all the remount/sync_fs code is in the one place. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
34061f5c |
|
08-Oct-2012 |
Dave Chinner <dchinner@redhat.com> |
xfs: xfs_sync_fsdata is redundant Why do we need to write the superblock to disk once we've written all the data? We don't actually - the reasons for doing this are lost in the mists of time, and go back to the way Irix used to drive VFS flushing. On linux, this code is only called from two contexts: remount and .sync_fs. In the remount case, the call is followed by a metadata sync, which unpins and writes the superblock. In the sync_fs case, we only need to force the log to disk to ensure that the superblock is correctly on disk, so we don't actually need to write it. Hence the functionality is either redundant or superfluous and thus can be removed. Seeing as xfs_quiesce_data is essentially now just a log force, remove it as well and fold the code back into the two callers. Neither of them need the log covering check, either, as that is redundant for the remount case, and unnecessary for the .sync_fs case. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
5889608d |
|
08-Oct-2012 |
Dave Chinner <dchinner@redhat.com> |
xfs: syncd workqueue is no more With the syncd functions moved to the log and/or removed, the syncd workqueue is the only remaining bit left. It is used by the log covering/ail pushing work, as well as by the inode reclaim work. Given how cheap workqueues are these days, give the log and inode reclaim work their own work queues and kill the syncd work queue. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
9aa05000 |
|
08-Oct-2012 |
Dave Chinner <dchinner@redhat.com> |
xfs: xfs_sync_data is redundant. We don't do any data writeback from XFS any more - the VFS is completely responsible for that, including for freeze. We can replace the remaining caller with a VFS level function that achieves the same thing, but without conflicting with current writeback work. This means we can remove the flush_work and xfs_flush_inodes() - the VFS functionality completely replaces the internal flush queue for doing this writeback work in a separate context to avoid stack overruns. This does have one complication - it cannot be called with page locks held. Hence move the flushing of delalloc space when ENOSPC occurs back up into xfs_file_aio_buffered_write when we don't hold any locks that will stall writeback. Unfortunately, writeback_inodes_sb_if_idle() is not sufficient to trigger delalloc conversion fast enough to prevent spurious ENOSPC whent here are hundreds of writers, thousands of small files and GBs of free RAM. Hence we need to use sync_sb_inodes() to block callers while we wait for writeback like the previous xfs_flush_inodes implementation did. That means we have to hold the s_umount lock here, but because this call can nest inside i_mutex (the parent directory in the create case, held by the VFS), we have to use down_read_trylock() to avoid potential deadlocks. In practice, this trylock will succeed on almost every attempt as unmount/remount type operations are exceedingly rare. Note: we always need to pass a count of zero to generic_file_buffered_write() as the previously written byte count. We only do this by accident before this patch by the virtue of ret always being zero when there are no errors. Make this explicit rather than needing to specifically zero ret in the ENOSPC retry case. Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
f661f1e0 |
|
08-Oct-2012 |
Dave Chinner <dchinner@redhat.com> |
xfs: sync work is now only periodic log work The only thing the periodic sync work does now is flush the AIL and idle the log. These are really functions of the log code, so move the work to xfs_log.c and rename it appropriately. The only wart that this leaves behind is the xfssyncd_centisecs sysctl, otherwise the xfssyncd is dead. Clean up any comments that related to xfssyncd to reflect it's passing. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
7f7bebef |
|
08-Oct-2012 |
Dave Chinner <dchinner@redhat.com> |
xfs: don't run the sync work if the filesystem is read-only If the filesystem is mounted or remounted read-only, stop the sync worker that tries to flush or cover the log if the filesystem is dirty. It's read-only, so it isn't dirty. Restart it on a remount,rw as necessary. This avoids the need for RO checks in the work. Similarly, stop the sync work when the filesystem is frozen, and start it again when the filesysetm is thawed. This avoids the need for special freeze checks in the work. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
7e18530b |
|
08-Oct-2012 |
Dave Chinner <dchinner@redhat.com> |
xfs: rationalise xfs_mount_wq users Instead of starting and stopping background work on the xfs_mount_wq all at the same time, separate them to where they really are needed to start and stop. The xfs_sync_worker, only needs to be started after all the mount processing has completed successfully, while it needs to be stopped before the log is unmounted. The xfs_reclaim_worker is started on demand, and can be stopped before the unmount process does it's own inode reclaim pass. The xfs_flush_inodes work is run on demand, and so we really only need to ensure that it has stopped running before we start processing an unmount, freeze or remount,ro. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
33c7a2bc |
|
08-Oct-2012 |
Dave Chinner <dchinner@redhat.com> |
xfs: xfs_syncd_stop must die xfs_syncd_start and xfs_syncd_stop tie a bunch of unrelated functionailty together that actually have different start and stop requirements. Kill these functions and open code the start/stop methods for each of the background functions. Subsequent patches will move the start/stop functions around to the correct places to avoid races and shutdown issues. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
8c0a8537 |
|
25-Sep-2012 |
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
fs: push rcu_barrier() from deactivate_locked_super() to filesystems There's no reason to call rcu_barrier() on every deactivate_locked_super(). We only need to make sure that all delayed rcu free inodes are flushed before we destroy related cache. Removing rcu_barrier() from deactivate_locked_super() affects some fast paths. E.g. on my machine exit_group() of a last process in IPC namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
2ea03929 |
|
20-Sep-2012 |
Carlos Maiolino <cmaiolino@redhat.com> |
xfs: Make inode32 a remountable option As inode64 is the default option now, and was also made remountable previously, inode32 can also be remounted on-the-fly when it is needed. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
4056c1d0 |
|
20-Sep-2012 |
Carlos Maiolino <cmaiolino@redhat.com> |
xfs: add inode64->inode32 transition into xfs_set_inode32() To make inode32 a remountable option, xfs_set_inode32() should be able to make a transition from inode64 option, disabling inode allocation on higher AGs. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
4c083722 |
|
20-Sep-2012 |
Carlos Maiolino <cmaiolino@redhat.com> |
xfs: Fix mp->m_maxagi update during inode64 remount With the changes made on xfs_set_inode64(), to make it behave as xfs_set_inode32() (now leaving to the caller the responsibility to update mp->m_maxagi), we use the return value of xfs_set_inode64() to update mp->m_maxagi during remount. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
2d2194f6 |
|
20-Sep-2012 |
Carlos Maiolino <cmaiolino@redhat.com> |
xfs: reduce code duplication handling inode32/64 options Add xfs_set_inode32() to be used to enable inode32 allocation mode. this will reduce the amount of duplicated code needed to mount/remount a filesystem with inode32 option. This patch also changes xfs_set_inode64() to return the maximum AG number that inodes can be allocated instead of set mp->m_maxagi by itself, so that the behaviour is the same as xfs_set_inode32(). This simplifies code that calls these functions and needs to know the maximum AG that inodes can be allocated in. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
08bf5404 |
|
20-Sep-2012 |
Carlos Maiolino <cmaiolino@redhat.com> |
xfs: make inode64 as the default allocation mode since 64-bit inodes can be accessed while using inode32, and these can also be used on 32-bit kernels, there is no reason to still keep inode32 as the default mount option. If the filesystem cannot handle 64bit inode numbers (i.e CONFIG_LBDAF is not enabled and BITS_PER_LONG == 32), XFS_MOUNT_SMALL_INUMS will still be set by default, so inode64 is not an unconditional default value. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
c3a58fec |
|
17-Aug-2012 |
Carlos Maiolino <cmaiolino@redhat.com> |
Make inode64 a remountable option Actually, there is no reason about why a user must umount and mount a XFS filesystem to enable 'inode64' option. So, this patch makes this a remountable option. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
0ba6e536 |
|
13-Sep-2012 |
Ben Myers <bpm@sgi.com> |
xfs: stop the sync worker before xfs_unmountfs Cancel work of the xfs_sync_worker before teardown of the log in xfs_unmountfs. This prevents occasional crashes on unmount like so: PID: 21602 TASK: ee9df060 CPU: 0 COMMAND: "kworker/0:3" #0 [c5377d28] crash_kexec at c0292c94 #1 [c5377d80] oops_end at c07090c2 #2 [c5377d98] no_context at c06f614e #3 [c5377dbc] __bad_area_nosemaphore at c06f6281 #4 [c5377df4] bad_area_nosemaphore at c06f629b #5 [c5377e00] do_page_fault at c070b0cb #6 [c5377e7c] error_code (via page_fault) at c070892c EAX: f300c6a8 EBX: f300c6a8 ECX: 000000c0 EDX: 000000c0 EBP: c5377ed0 DS: 007b ESI: 00000000 ES: 007b EDI: 00000001 GS: ffffad20 CS: 0060 EIP: c0481ad0 ERR: ffffffff EFLAGS: 00010246 #7 [c5377eb0] atomic64_read_cx8 at c0481ad0 #8 [c5377ebc] xlog_assign_tail_lsn_locked at f7cc7c6e [xfs] #9 [c5377ed4] xfs_trans_ail_delete_bulk at f7ccd520 [xfs] #10 [c5377f0c] xfs_buf_iodone at f7ccb602 [xfs] #11 [c5377f24] xfs_buf_do_callbacks at f7cca524 [xfs] #12 [c5377f30] xfs_buf_iodone_callbacks at f7cca5da [xfs] #13 [c5377f4c] xfs_buf_iodone_work at f7c718d0 [xfs] #14 [c5377f58] process_one_work at c024ee4c #15 [c5377f98] worker_thread at c024f43d #16 [c5377fbc] kthread at c025326b #17 [c5377fe8] kernel_thread_helper at c070e834 PID: 26653 TASK: e79143b0 CPU: 3 COMMAND: "umount" #0 [cde0fda0] __schedule at c0706595 #1 [cde0fe28] schedule at c0706b89 #2 [cde0fe30] schedule_timeout at c0705600 #3 [cde0fe94] __down_common at c0706098 #4 [cde0fec8] __down at c0706122 #5 [cde0fed0] down at c025936f #6 [cde0fee0] xfs_buf_lock at f7c7131d [xfs] #7 [cde0ff00] xfs_freesb at f7cc2236 [xfs] #8 [cde0ff10] xfs_fs_put_super at f7c80f21 [xfs] #9 [cde0ff1c] generic_shutdown_super at c0333d7a #10 [cde0ff38] kill_block_super at c0333e0f #11 [cde0ff48] deactivate_locked_super at c0334218 #12 [cde0ff58] deactivate_super at c033495d #13 [cde0ff68] mntput_no_expire at c034bc13 #14 [cde0ff7c] sys_umount at c034cc69 #15 [cde0ffa0] sys_oldumount at c034ccd4 #16 [cde0ffb0] system_call at c0707e66 commit 11159a05 added this to xfs_log_unmount and needs to be cleaned up at a later date. Signed-off-by: Ben Myers <bpm@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com>
|
#
4026c9fd |
|
13-Sep-2012 |
Ben Myers <bpm@sgi.com> |
xfs: stop the sync worker before xfs_unmountfs Cancel work of the xfs_sync_worker before teardown of the log in xfs_unmountfs. This prevents occasional crashes on unmount like so: PID: 21602 TASK: ee9df060 CPU: 0 COMMAND: "kworker/0:3" #0 [c5377d28] crash_kexec at c0292c94 #1 [c5377d80] oops_end at c07090c2 #2 [c5377d98] no_context at c06f614e #3 [c5377dbc] __bad_area_nosemaphore at c06f6281 #4 [c5377df4] bad_area_nosemaphore at c06f629b #5 [c5377e00] do_page_fault at c070b0cb #6 [c5377e7c] error_code (via page_fault) at c070892c EAX: f300c6a8 EBX: f300c6a8 ECX: 000000c0 EDX: 000000c0 EBP: c5377ed0 DS: 007b ESI: 00000000 ES: 007b EDI: 00000001 GS: ffffad20 CS: 0060 EIP: c0481ad0 ERR: ffffffff EFLAGS: 00010246 #7 [c5377eb0] atomic64_read_cx8 at c0481ad0 #8 [c5377ebc] xlog_assign_tail_lsn_locked at f7cc7c6e [xfs] #9 [c5377ed4] xfs_trans_ail_delete_bulk at f7ccd520 [xfs] #10 [c5377f0c] xfs_buf_iodone at f7ccb602 [xfs] #11 [c5377f24] xfs_buf_do_callbacks at f7cca524 [xfs] #12 [c5377f30] xfs_buf_iodone_callbacks at f7cca5da [xfs] #13 [c5377f4c] xfs_buf_iodone_work at f7c718d0 [xfs] #14 [c5377f58] process_one_work at c024ee4c #15 [c5377f98] worker_thread at c024f43d #16 [c5377fbc] kthread at c025326b #17 [c5377fe8] kernel_thread_helper at c070e834 PID: 26653 TASK: e79143b0 CPU: 3 COMMAND: "umount" #0 [cde0fda0] __schedule at c0706595 #1 [cde0fe28] schedule at c0706b89 #2 [cde0fe30] schedule_timeout at c0705600 #3 [cde0fe94] __down_common at c0706098 #4 [cde0fec8] __down at c0706122 #5 [cde0fed0] down at c025936f #6 [cde0fee0] xfs_buf_lock at f7c7131d [xfs] #7 [cde0ff00] xfs_freesb at f7cc2236 [xfs] #8 [cde0ff10] xfs_fs_put_super at f7c80f21 [xfs] #9 [cde0ff1c] generic_shutdown_super at c0333d7a #10 [cde0ff38] kill_block_super at c0333e0f #11 [cde0ff48] deactivate_locked_super at c0334218 #12 [cde0ff58] deactivate_super at c033495d #13 [cde0ff68] mntput_no_expire at c034bc13 #14 [cde0ff7c] sys_umount at c034cc69 #15 [cde0ffa0] sys_oldumount at c034ccd4 #16 [cde0ffb0] system_call at c0707e66 commit 11159a05 added this to xfs_log_unmount and needs to be cleaned up at a later date. Signed-off-by: Ben Myers <bpm@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com>
|
#
43829731 |
|
20-Aug-2012 |
Tejun Heo <tj@kernel.org> |
workqueue: deprecate flush[_delayed]_work_sync() flush[_delayed]_work_sync() are now spurious. Mark them deprecated and convert all users to flush[_delayed]_work(). If you're cc'd and wondering what's going on: Now all workqueues are non-reentrant and the regular flushes guarantee that the work item is not pending or running on any CPU on return, so there's no reason to use the sync flushes at all and they're going away. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mattia Dongili <malattia@linux.it> Cc: Kent Yoder <key@linux.vnet.ibm.com> Cc: David Airlie <airlied@linux.ie> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Karsten Keil <isdn@linux-pingi.de> Cc: Bryan Wu <bryan.wu@canonical.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Cc: David Woodhouse <dwmw2@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: linux-wireless@vger.kernel.org Cc: Anton Vorontsov <cbou@mail.ru> Cc: Sangbeom Kim <sbkim73@samsung.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Petr Vandrovec <petr@vandrovec.name> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Avi Kivity <avi@redhat.com>
|
#
4f59af75 |
|
04-Jul-2012 |
Christoph Hellwig <hch@infradead.org> |
xfs: remove iolock lock classes Content-Disposition: inline; filename=xfs-remove-iolock-classes Now that we never take the iolock during inode reclaim we don't need to play games with lock classes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Rich Johnston <rjohnston@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
69ff2826 |
|
06-Jun-2012 |
Christoph Hellwig <hch@infradead.org> |
xfs: implement ->update_time Use this new method to replace our hacky use of ->dirty_inode. An additional benefit is that we can now propagate errors up the stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
1d9025e5 |
|
22-Jun-2012 |
Dave Chinner <dchinner@redhat.com> |
xfs: remove struct xfs_dabuf and infrastructure The struct xfs_dabuf now only tracks a single xfs_buf and all the information it holds can be gained directly from the xfs_buf. Hence we can remove the struct dabuf and pass the xfs_buf around everywhere. Kill the struct dabuf and the associated infrastructure. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
77c1a08f |
|
22-Jun-2012 |
Dave Chinner <dchinner@redhat.com> |
xfs: struct xfs_buf_log_format isn't variable sized. The struct xfs_buf_log_format wants to think the dirty bitmap is variable sized. In fact, it is variable size on disk simply due to the way we map it from the in-memory structure, but we still just use a fixed size memory allocation for the in-memory structure. Hence it makes no sense to set the function up as a variable sized structure when we already know it's maximum size, and we always allocate it as such. Simplify the structure by making the dirty bitmap a fixed sized array and just using the size of the structure for the allocation size. This will make it much simpler to allocate and manipulate an array of format structures for discontiguous buffer support. The previous struct xfs_buf_log_item size according to /proc/slabinfo was 224 bytes. pahole doesn't give the same size because of the variable size definition. With this modification, pahole reports the same as /proc/slabinfo: /* size: 224, cachelines: 4, members: 6 */ Because the xfs_buf_log_item size is now determined by the maximum supported block size we introduce a dependency on xfs_alloc_btree.h. Avoid this dependency by moving the idefines for the maximum block sizes supported to xfs_types.h with all the other max/min type defines to avoid any new dependencies. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
ad1e95c5 |
|
22-Apr-2012 |
Dave Chinner <dchinner@redhat.com> |
xfs: clean up xfs_bit.h includes With the removal of xfs_rw.h and other changes over time, xfs_bit.h is being included in many files that don't actually need it. Clean up the includes as necessary. Also move the only-used-once xfs_ialloc_find_free() static inline function out of a header file that is widely included to reduce the number of needless dependencies on xfs_bit.h. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
4c2d542f |
|
23-Apr-2012 |
Dave Chinner <david@fromorbit.com> |
xfs: Do background CIL flushes via a workqueue Doing background CIL flushes adds significant latency to whatever async transaction that triggers it. To avoid blocking async transactions on things like waiting for log buffer IO to complete, move the CIL push off into a workqueue. By moving the push work into a workqueue, we remove all the latency that the commit adds from the foreground transaction commit path. This also means that single threaded workloads won't do the CIL push procssing, leaving them more CPU to do more async transactions. To do this, we need to keep track of the sequence number we have pushed work for. This avoids having many transaction commits attempting to schedule work for the same sequence, and ensures that we only ever have one push (background or forced) in progress at a time. It also means that we don't need to take the CIL lock in write mode to check for potential background push races, which reduces lock contention. To avoid potential issues with "smart" IO schedulers, don't use the workqueue for log force triggered flushes. Instead, do them directly so that the log IO is done directly by the process issuing the log force and so doesn't get stuck on IO elevator queue idling incorrectly delaying the log IO from the workqueue. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
43ff2122 |
|
22-Apr-2012 |
Christoph Hellwig <hch@infradead.org> |
xfs: on-stack delayed write buffer lists Queue delwri buffers on a local on-stack list instead of a per-buftarg one, and write back the buffers per-process instead of by waking up xfsbufd. This is now easily doable given that we have very few places left that write delwri buffers: - log recovery: Only done at mount time, and already forcing out the buffers synchronously using xfs_flush_buftarg - quotacheck: Same story. - dquot reclaim: Writes out dirty dquots on the LRU under memory pressure. We might want to look into doing more of this via xfsaild, but it's already more optimal than the synchronous inode reclaim that writes each buffer synchronously. - xfsaild: This is the main beneficiary of the change. By keeping a local list of buffers to write we reduce latency of writing out buffers, and more importably we can remove all the delwri list promotions which were hitting the buffer cache hard under sustained metadata loads. The implementation is very straight forward - xfs_buf_delwri_queue now gets a new list_head pointer that it adds the delwri buffers to, and all callers need to eventually submit the list using xfs_buf_delwi_submit or xfs_buf_delwi_submit_nowait. Buffers that already are on a delwri list are skipped in xfs_buf_delwri_queue, assuming they already are on another delwri list. The biggest change to pass down the buffer list was done to the AIL pushing. Now that we operate on buffers the trylock, push and pushbuf log item methods are merged into a single push routine, which tries to lock the item, and if possible add the buffer that needs writeback to the buffer list. This leads to much simpler code than the previous split but requires the individual IOP_PUSH instances to unlock and reacquire the AIL around calls to blocking routines. Given that xfsailds now also handle writing out buffers, the conditions for log forcing and the sleep times needed some small changes. The most important one is that we consider an AIL busy as long we still have buffers to push, and the other one is that we do increment the pushed LSN for buffers that are under flushing at this moment, but still count them towards the stuck items for restart purposes. Without this we could hammer on stuck items without ever forcing the log and not make progress under heavy random delete workloads on fast flash storage devices. [ Dave Chinner: - rebase on previous patches. - improved comments for XBF_DELWRI_Q handling - fix XBF_ASYNC handling in queue submission (test 106 failure) - rename delwri submit function buffer list parameters for clarity - xfs_efd_item_push() should return XFS_ITEM_PINNED ] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
7582df51 |
|
24-Apr-2012 |
Shaohua Li <shli@kernel.org> |
xfs: using GFP_NOFS for blkdev_issue_flush Issuing a block device flush request in transaction context using GFP_KERNEL directly can cause deadlocks due to memory reclaim recursion. Use GFP_NOFS to avoid recursion from reclaim context. Signed-off-by: Shaohua Li <shli@fusionio.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
dbd5768f |
|
03-May-2012 |
Jan Kara <jack@suse.cz> |
vfs: Rename end_writeback() to clear_inode() After we moved inode_sync_wait() from end_writeback() it doesn't make sense to call the function end_writeback() anymore. Rename it to clear_inode() which well says what the function really does - set I_CLEAR flag. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
|
#
8a00ebe4 |
|
12-Apr-2012 |
Dave Chinner <david@fromorbit.com> |
xfs: Ensure inode reclaim can run during quotacheck Because the mount process can run a quotacheck and consume lots of inodes, we need to be able to run periodic inode reclaim during the mount process. This will prevent running the system out of memory during quota checks. This essentially reverts 2bcf6e97, but that is safe to do now that the quota sync code that was causing problems during long quotacheck executions is now gone. The reclaim work is currently protected from running during the unmount process by a check against MS_ACTIVE. Unfortunately, this also means that the reclaim work cannot run during mount. The unmount process should stop the reclaim cleanly before freeing anything that the reclaim work depends on, so there is no need to have this guard in place. Also, the inode reclaim work is demand driven, so there is no need to start it immediately during mount. It will be started the moment an inode is queued for reclaim, so qutoacheck will trigger it just fine. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
da5bf95e |
|
11-Apr-2012 |
Jie Liu <jeff.liu@oracle.com> |
xfs: don't fill statvfs with project quota for a directory if it was not enabled. Check if the project quota is running or not before performing xfs_qm_statvfs(), just return if not. Otherwise the ASSERT XFS_IS_QUOTA_RUNNING in xfs_qm_dqget will be popped. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
5132ba8f |
|
21-Mar-2012 |
Dave Chinner <dchinner@redhat.com> |
xfs: don't cache inodes read through bulkstat When we read inodes via bulkstat, we generally only read them once and then throw them away - they never get used again. If we retain them in cache, then it simply causes the working set of inodes and other cached items to be reclaimed just so the inode cache can grow. Avoid this problem by marking inodes read by bulkstat not to be cached and check this flag in .drop_inode to determine whether the inode should be added to the VFS LRU or not. If the inode lookup hits an already cached inode, then don't set the flag. If the inode lookup hits an inode marked with no cache flag, remove the flag and allow it to be cached once the current reference goes away. Inodes marked as not cached will get cleaned up by the background inode reclaim or via memory pressure, so they will still generate some short term cache pressure. They will, however, be reclaimed much sooner and in preference to cache hot inodes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
c999a223 |
|
21-Mar-2012 |
Dave Chinner <dchinner@redhat.com> |
xfs: introduce an allocation workqueue We currently have significant issues with the amount of stack that allocation in XFS uses, especially in the writeback path. We can easily consume 4k of stack between mapping the page, manipulating the bmap btree and allocating blocks from the free list. Not to mention btree block readahead and other functionality that issues IO in the allocation path. As a result, we can no longer fit allocation in the writeback path in the stack space provided on x86_64. To alleviate this problem, introduce an allocation workqueue and move all allocations to a seperate context. This can be easily added as an interposing layer into xfs_alloc_vextent(), which takes a single argument structure and does not return until the allocation is complete or has failed. To do this, add a work structure and a completion to the allocation args structure. This allows xfs_alloc_vextent to queue the args onto the workqueue and wait for it to be completed by the worker. This can be done completely transparently to the caller. The worker function needs to ensure that it sets and clears the PF_TRANS flag appropriately as it is being run in an active transaction context. Work can also be queued in a memory reclaim context, so a rescuer is needed for the workqueue. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
48fde701 |
|
08-Jan-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
switch open-coded instances of d_make_root() to new helper Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
8de52778 |
|
05-Feb-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
vfs: check i_nlink limits in vfs_{mkdir,rename_dir,link} New field of struct super_block - ->s_max_links. Maximal allowed value of ->i_nlink or 0; in the latter case all checks still need to be done in ->link/->mkdir/->rename instances. Note that this limit applies both to directoris and to non-directories. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
a05931ce |
|
13-Mar-2012 |
Christoph Hellwig <hch@infradead.org> |
xfs: remove the global xfs_Gqm structure If we initialize the slab caches for the quota code when XFS is loaded there is no need for a global and reference counted quota manager structure. Drop all this overhead and also fix the error handling during quota initialization. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
8f639dde |
|
29-Feb-2012 |
Christoph Hellwig <hch@infradead.org> |
xfs: reimplement fdatasync support Add an in-memory only flag to say we logged timestamps only, and use it to check if fdatasync can optimize away the log force. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
8a9c9980 |
|
29-Feb-2012 |
Christoph Hellwig <hch@infradead.org> |
xfs: log timestamp updates Timestamps on regular files are the last metadata that XFS does not update transactionally. Now that we use the delaylog mode exclusively and made the log scode scale extremly well there is no need to bypass that code for timestamp updates. Logging all updates allows to drop a lot of code, and will allow for further performance improvements later on. Note that this patch drops optimized handling of fdatasync - it will be added back in a separate commit. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
aa6bf01d |
|
29-Feb-2012 |
Christoph Hellwig <hch@infradead.org> |
xfs: use per-filesystem I/O completion workqueues The new concurrency managed workqueues are cheap enough that we can create per-filesystem instead of global workqueues. This allows us to remove the trylock or defer scheme on the ilock, which is not helpful once we have outstanding log reservations until finishing a size update. Also allow the default concurrency on this workqueues so that I/O completions blocking on the ilock for one inode do not block process for another inode. Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
4177af3a |
|
23-Jan-2012 |
Chandra Seetharaman <sekharan@us.ibm.com> |
Define new macro XFS_ALL_QUOTA_ACTIVE and simply some usage Define new macro XFS_ALL_QUOTA_ACTIVE and simply some usage of quota macros. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
f392e631 |
|
18-Dec-2011 |
Christoph Hellwig <hch@infradead.org> |
xfs: replace i_pin_wait with a bit waitqueue Replace i_pin_wait, which is only used during synchronous inode flushing with a bit waitqueue. This trades off a much smaller inode against slightly slower wakeup performance, and saves 12 (32-bit) or 20 (64-bit) bytes in the XFS inode. Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
474fce06 |
|
18-Dec-2011 |
Christoph Hellwig <hch@infradead.org> |
xfs: replace i_flock with a sleeping bitlock We almost never block on i_flock, the exception is synchronous inode flushing. Instead of bloating the inode with a 16/24-byte completion that we abuse as a semaphore just implement it as a bitlock that uses a bit waitqueue for the rare sleeping path. This primarily is a tradeoff between a much smaller inode and a faster non-blocking path vs faster wakeups, and we are much better off with the former. A small downside is that we will lose lockdep checking for i_flock, but given that it's always taken inside the ilock that should be acceptable. Note that for example the inode writeback locking is implemented in a very similar way. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
34c80b1d |
|
08-Dec-2011 |
Al Viro <viro@zeniv.linux.org.uk> |
vfs: switch ->show_options() to struct dentry * Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
be4f1ac8 |
|
20-Dec-2011 |
Christoph Hellwig <hch@infradead.org> |
xfs: log all dirty inodes in xfs_fs_sync_fs Since Linux 2.6.36 the writeback code has introduces various measures for live lock prevention during sync(). Unfortunately some of these are actively harmful for the XFS model, where the inode gets marked dirty for metadata from the data I/O handler. The older_than_this checks that are now more strictly enforced since writeback: avoid livelocking WB_SYNC_ALL writeback by only calling into __writeback_inodes_sb and thus only sampling the current cut off time once. But on a slow enough devices the previous asynchronous sync pass might not have fully completed yet, and thus XFS might mark metadata dirty only after that sampling of the cut off time for the blocking pass already happened. I have not myself reproduced this myself on a real system, but by introducing artificial delay into the XFS I/O completion workqueues it can be reproduced easily. Fix this by iterating over all XFS inodes in ->sync_fs and log all that are dirty. This might log inode that only got redirtied after the previous pass, but given how cheap delayed logging of inodes is it isn't a major concern for performance. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Tested-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
0b8fd303 |
|
18-Dec-2011 |
Christoph Hellwig <hch@infradead.org> |
xfs: log the inode in ->write_inode calls for kupdate If the writeback code writes back an inode because it has expired we currently use the non-blockin ->write_inode path. This means any inode that is pinned is skipped. With delayed logging and a workload that has very little log traffic otherwise it is very likely that an inode that gets constantly written to is always pinned, and thus we keep refusing to write it. The VM writeback code at that point redirties it and doesn't try to write it again for another 30 seconds. This means under certain scenarious time based metadata writeback never happens. Fix this by calling into xfs_log_inode for kupdate in addition to data integrity syncs, and thus transfer the inode to the log ASAP. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Tested-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
40d344ec |
|
05-Dec-2011 |
Christoph Hellwig <hch@infradead.org> |
xfs: mark the xfssyncd workqueue as non-reentrant On a system with lots of memory pressure that is stuck on synchronous inode reclaim the workqueue code will run one instance of the inode reclaim work item on every CPU. which is not what we want. Make sure to mark the xfssyncd workqueue as non-reentrant to make sure there only is one instace of each running globally. Also stop using special paramater for the workqueue; now that we guarantee each fs has only running one of each works at a time there is no need to artificially lower max_active and compensate for that by setting the WQ_CPU_INTENSIVE flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
34625c66 |
|
06-Dec-2011 |
Christoph Hellwig <hch@infradead.org> |
xfs: remove xfs_qm_sync Now that we can't have any dirty dquots around that aren't in the AIL we can get rid of the explicit dquot syncing from xfssyncd and xfs_fs_sync_fs and instead rely on AIL pushing to write out any quota updates. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
93b8a585 |
|
06-Dec-2011 |
Christoph Hellwig <hch@infradead.org> |
xfs: remove the deprecated nodelaylog option The delaylog mode has been the default for a long time, and the nodelaylog option has been scheduled for removal in Linux 3.3. Remove it and code only used by it now that we have opened the 3.3 window. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
a9add83e |
|
10-Oct-2011 |
Christoph Hellwig <hch@infradead.org> |
xfs: remove XFS_bflush Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
|
#
ddc3415a |
|
19-Sep-2011 |
Christoph Hellwig <hch@infradead.org> |
xfs: simplify xfs_trans_ijoin* again There is no reason to keep a reference to the inode even if we unlock it during transaction commit because we never drop a reference between the ijoin and commit. Also use this fact to merge xfs_trans_ijoin_ref back into xfs_trans_ijoin - the third argument decides if an unlock is needed now. I'm actually starting to wonder if allowing inodes to be unlocked at transaction commit really is worth the effort. The only real benefit is that they can be unlocked earlier when commiting a synchronous transactions, but that could be solved by doing the log force manually after the unlock, too. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
|
#
4a06fd26 |
|
23-Aug-2011 |
Christoph Hellwig <hch@infradead.org> |
xfs: remove i_iocount We now have an i_dio_count filed and surrounding infrastructure to wait for direct I/O completion instead of i_icount, and we have never needed to iocount waits for buffered I/O given that we only set the page uptodate after finishing all required work. Thus remove i_iocount, and replace the actually needed waits with calls to inode_dio_wait. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
|
#
0030807c |
|
11-Oct-2011 |
Christoph Hellwig <hch@infradead.org> |
xfs: revert to using a kthread for AIL pushing Currently we have a few issues with the way the workqueue code is used to implement AIL pushing: - it accidentally uses the same workqueue as the syncer action, and thus can be prevented from running if there are enough sync actions active in the system. - it doesn't use the HIGHPRI flag to queue at the head of the queue of work items At this point I'm not confident enough in getting all the workqueue flags and tweaks right to provide a perfectly reliable execution context for AIL pushing, which is the most important piece in XFS to make forward progress when the log fills. Revert back to use a kthread per filesystem which fixes all the above issues at the cost of having a task struct and stack around for each mounted filesystem. In addition this also gives us much better ways to diagnose any issues involving hung AIL pushing and removes a small amount of code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Stefan Priebe <s.priebe@profihost.ag> Tested-by: Stefan Priebe <s.priebe@profihost.ag> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
|
#
58d84c4e |
|
26-Aug-2011 |
Christoph Hellwig <hch@infradead.org> |
xfs: fix ->write_inode return values Currently we always redirty an inode that was attempted to be written out synchronously but has been cleaned by an AIL pushed internall, which is rather bogus. Fix that by doing the i_update_core check early on and return 0 for it. Also include async calls for it, as doing any work for those is just as pointless. While we're at it also fix the sign for the EIO return in case of a filesystem shutdown, and fix the completely non-sensical locking around xfs_log_inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com> (cherry picked from commit 297db93bb74cf687510313eb235a7aec14d67e97) Signed-off-by: Alex Elder <aelder@sgi.com>
|
#
242d6219 |
|
23-Aug-2011 |
Christoph Hellwig <hch@infradead.org> |
xfs: deprecate the nodelaylog mount option Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
|
#
c59d87c4 |
|
12-Aug-2011 |
Christoph Hellwig <hch@infradead.org> |
xfs: remove subdirectories Use the move from Linux 2.6 to Linux 3.x as an excuse to kill the annoying subdirectories in the XFS source code. Besides the large amount of file rename the only changes are to the Makefile, a few files including headers with the subdirectory prefix, and the binary sysctl compat code that includes a header under fs/xfs/ from kernel/. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
|