#
3fed24ff |
|
19-Feb-2024 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
xfs: Replace xfs_isilocked with xfs_assert_ilocked To use the new rwsem_assert_held()/rwsem_assert_held_write(), we can't use the existing ASSERT macro. Add a new xfs_assert_ilocked() and convert all the callers. Fix an apparent bug in xfs_isilocked(): If the caller specifies XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL, xfs_assert_ilocked() will check both the IOLOCK and the ILOCK are held for write. xfs_isilocked() only checked that the ILOCK was held for write. xfs_assert_ilocked() is always on, even if DEBUG or XFS_WARN aren't defined. It's a cheap check, so I don't think it's worth defining it away. Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
#
8f71bede |
|
15-Dec-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: repair inode fork block mapping data structures Use the reverse-mapping btree information to rebuild an inode block map. Update the btree bulk loading code as necessary to support inode rooted btrees and fix some bitrot problems. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
ef5a83b7 |
|
16-Oct-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: use shifting and masking when converting rt extents, if possible Avoid the costs of integer division (32-bit and 64-bit) if the realtime extent size is a power of two. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
2c2b981b |
|
16-Oct-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: create a helper to convert extlen to rtextlen Create a helper to compute the realtime extent (xfs_rtxlen_t) from an extent length (xfs_extlen_t) value. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
d4d12c02 |
|
04-Jun-2023 |
Dave Chinner <dchinner@redhat.com> |
xfs: collect errors from inodegc for unlinked inode recovery Unlinked list recovery requires errors removing the inode the from the unlinked list get fed back to the main recovery loop. Now that we offload the unlinking to the inodegc work, we don't get errors being fed back when we trip over a corruption that prevents the inode from being removed from the unlinked list. This means we never clear the corrupt unlinked list bucket, resulting in runtime operations eventually tripping over it and shutting down. Fix this by collecting inodegc worker errors and feed them back to the flush caller. This is largely best effort - the only context that really cares is log recovery, and it only flushes a single inode at a time so we don't need complex synchronised handling. Essentially the inodegc workers will capture the first error that occurs and the next flush will gather them and clear them. The flush itself will only report the first gathered error. In the cases where callers can return errors, propagate the collected inodegc flush error up the error handling chain. In the case of inode unlinked list recovery, there are several superfluous calls to flush queued unlinked inodes - xlog_recover_iunlink_bucket() guarantees that it has flushed the inodegc and collected errors before it returns. Hence nothing in the calling path needs to run a flush, even when an error is returned. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
cb042117 |
|
04-Jun-2023 |
Dave Chinner <dchinner@redhat.com> |
xfs: defered work could create precommits To fix a AGI-AGF-inode cluster buffer deadlock, we need to move inode cluster buffer operations to the ->iop_precommit() method. However, this means that deferred operations can require precommits to be run on the final transaction that the deferred ops pass back to xfs_trans_commit() context. This will be exposed by attribute handling, in that the last changes to the inode in the attr set state machine "disappear" because the precommit operation is not run. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
55d5c3a3 |
|
10-Feb-2023 |
Dave Chinner <dchinner@redhat.com> |
xfs: don't assert fail on transaction cancel with deferred ops We can error out of an allocation transaction when updating BMBT blocks when things go wrong. This can be a btree corruption, and unexpected ENOSPC, etc. In these cases, we already have deferred ops queued for the first allocation that has been done, and we just want to cancel out the transaction and shut down the filesystem on error. In fact, we do just that for production systems - the assert that we can't have a transaction with defer ops attached unless we are already shut down is bogus and gets in the way of debugging whatever issue is actually causing the transaction to be cancelled. Remove the assert because it is causing spurious test failures to hang test machines. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
|
#
692b6cdd |
|
10-Feb-2023 |
Dave Chinner <dchinner@redhat.com> |
xfs: t_firstblock is tracking AGs not blocks The tp->t_firstblock field is now raelly tracking the highest AG we have locked, not the block number of the highest allocation we've made. It's purpose is to prevent AGF locking deadlocks, so rename it to "highest AG" and simplify the implementation to just track the agno rather than a fsbno. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
|
#
fad743d7 |
|
13-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: add log item precommit operation For inodes that are dirty, we have an attached cluster buffer that we want to use to track the dirty inode through the AIL. Unfortunately, locking the cluster buffer and adding it to the transaction when the inode is first logged in a transaction leads to buffer lock ordering inversions. The specific problem is ordering against the AGI buffer. When modifying unlinked lists, the buffer lock order is AGI -> inode cluster buffer as the AGI buffer lock serialises all access to the unlinked lists. Unfortunately, functionality like xfs_droplink() logs the inode before calling xfs_iunlink(), as do various directory manipulation functions. The inode can be logged way down in the stack as far as the bmapi routines and hence, without a major rewrite of lots of APIs there's no way we can avoid the inode being logged by something until after the AGI has been logged. As we are going to be using ordered buffers for inode AIL tracking, there isn't a need to actually lock that buffer against modification as all the modifications are captured by logging the inode item itself. Hence we don't actually need to join the cluster buffer into the transaction until just before it is committed. This means we do not perturb any of the existing buffer lock orders in transactions, and the inode cluster buffer is always locked last in a transaction that doesn't otherwise touch inode cluster buffers. We do this by introducing a precommit log item method. This commit just introduces the mechanism; the inode item implementation is in followup commits. The precommit items need to be sorted into consistent order as we may be locking multiple items here. Hence if we have two dirty inodes in cluster buffers A and B, and some other transaction has two separate dirty inodes in the same cluster buffers, locking them in different orders opens us up to ABBA deadlocks. Hence we sort the items on the transaction based on the presence of a sort log item method. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
16924853 |
|
07-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: convert log vector chain to use list heads Because the next change is going to require sorting log vectors, and that requires arbitrary rearrangement of the list which cannot be done easily with a single linked list. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
|
#
918247ce |
|
25-Apr-2022 |
Darrick J. Wong <djwong@kernel.org> |
xfs: report "max_resp" used for min log size computation Move the tracepoint that computes the size of the transaction used to compute the minimum log size into xfs_log_get_max_trans_res so that we only have to compute this stuff once. Leave xfs_log_get_max_trans_res as a non-static function so that xfs_db can call it to report the results of the userspace computation of the same value to diagnose mkfs/kernel misinteractions. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
c7610dce |
|
20-Apr-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: log tickets don't need log client id We currently set the log ticket client ID when we reserve a transaction. This client ID is only ever written to the log by a CIL checkpoint or unmount records, and so anything using a high level transaction allocated through xfs_trans_alloc() does not need a log ticket client ID to be set. For the CIL checkpoint, the client ID written to the journal is always XFS_TRANSACTION, and for the unmount record it is always XFS_LOG, and nothing else writes to the log. All of these operations tell xlog_write() exactly what they need to write to the log (the optype) and build their own opheaders for start, commit and unmount records. Hence we no longer need to set the client id in either the log ticket or the xfs_trans. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
2229276c |
|
11-Apr-2022 |
Darrick J. Wong <djwong@kernel.org> |
xfs: use a separate frextents counter for rt extent reservations As mentioned in the previous commit, the kernel misuses sb_frextents in the incore mount to reflect both incore reservations made by running transactions as well as the actual count of free rt extents on disk. This results in the superblock being written to the log with an underestimate of the number of rt extents that are marked free in the rtbitmap. Teaching XFS to recompute frextents after log recovery avoids operational problems in the current mount, but it doesn't solve the problem of us writing undercounted frextents which are then recovered by an older kernel that doesn't have that fix. Create an incore percpu counter to mirror the ondisk frextents. This new counter will track transaction reservations and the only time we will touch the incore super counter (i.e the one that gets logged) is when those transactions commit updates to the rt bitmap. This is in contrast to the lazysbcount counters (e.g. fdblocks), where we know that log recovery will always fix any incorrect counter that we log. As a bonus, we only take m_sb_lock at transaction commit time. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
3c4cb76b |
|
29-Mar-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: xfs_trans_commit() path must check for log shutdown If a shut races with xfs_trans_commit() and we have shut down the filesystem but not the log, we will still cancel the transaction. This can result in aborting dirty log items instead of committing and pinning them whilst the log is still running. Hence we can end up with dirty, unlogged metadata that isn't in the AIL in memory that can be flushed to disk via writeback clustering. This was discovered from a g/388 trace where an inode log item was having IO completed on it and it wasn't in the AIL, hence tripping asserts xfs_ail_check(). Inode cluster writeback started long after the filesystem shutdown started, and long after the transaction containing the dirty inode was aborted and the log item marked XFS_LI_ABORTED. The inode was seen as dirty and unpinned, so it was flushed. IO completion tried to remove the inode from the AIL, at which point stuff went bad: XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/xfs/xfs_fsops.c:500). Shutting down filesystem. XFS: Assertion failed: in_ail, file: fs/xfs/xfs_trans_ail.c, line: 67 XFS (pmem1): Please unmount the filesystem and rectify the problem(s) Workqueue: xfs-buf/pmem1 xfs_buf_ioend_work RIP: 0010:assfail+0x27/0x2d Call Trace: <TASK> xfs_ail_check+0xa8/0x180 xfs_ail_delete_one+0x3b/0xf0 xfs_buf_inode_iodone+0x329/0x3f0 xfs_buf_ioend+0x1f8/0x530 xfs_buf_ioend_work+0x15/0x20 process_one_work+0x1ac/0x390 worker_thread+0x56/0x3c0 kthread+0xf6/0x120 ret_from_fork+0x1f/0x30 </TASK> xfs_trans_commit() needs to check log state for shutdown, not mount state. It cannot abort dirty log items while the log is still running as dirty items must remained pinned in memory until they are either committed to the journal or the log has shut down and they can be safely tossed away. Hence if the log has not shut down, the xfs_trans_commit() path must allow completed transactions to commit to the CIL and pin the dirty items even if a mount shutdown has started. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
8eda8721 |
|
17-Mar-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: AIL should be log centric The AIL operates purely on log items, so it is a log centric subsystem. Divorce it from the xfs_mount and instead have it pass around xlog pointers. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
d86142dd |
|
17-Mar-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: log items should have a xlog pointer, not a mount Log items belong to the log, not the xfs_mount. Convert the mount pointer in the log item to a xlog pointer in preparation for upcoming log centric changes to the log items. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
871b9316 |
|
25-Feb-2022 |
Darrick J. Wong <djwong@kernel.org> |
xfs: reserve quota for dir expansion when linking/unlinking files XFS does not reserve quota for directory expansion when linking or unlinking children from a directory. This means that we don't reject the expansion with EDQUOT when we're at or near a hard limit, which means that unprivileged userspace can use link()/unlink() to exceed quota. The fix for this is nuanced -- link operations don't always expand the directory, and we allow a link to proceed with no space reservation if we don't need to add a block to the directory to handle the addition. Unlink operations generally do not expand the directory (you'd have to free a block and then cause a btree split) and we can defer the directory block freeing if there is no space reservation. Moreover, there is a further bug in that we do not trigger the blockgc workers to try to clear space when we're out of quota. To fix both cases, create a new xfs_trans_alloc_dir function that allocates the transaction, locks and joins the inodes, and reserves quota for the directory. If there isn't sufficient space or quota, we'll switch the caller to reservationless mode. This should prevent quota usage overruns with the least restriction in functionality. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
47a6df7c |
|
15-Dec-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: shut down filesystem if we xfs_trans_cancel with deferred work items While debugging some very strange rmap corruption reports in connection with the online directory repair code. I root-caused the error to the following incorrect sequence: <start repair transaction> <expand directory, causing a deferred rmap to be queued> <roll transaction> <cancel transaction> Obviously, we should have committed the transaction instead of cancelling it. Thinking more broadly, however, xfs_trans_cancel should have warned us that we were throwing away work item that we already committed to performing. This is not correct, and we need to shut down the filesystem. Change xfs_trans_cancel to complain in the loudest manner if we're cancelling any transaction with deferred work items attached. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
182696fb |
|
12-Oct-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: rename _zone variables to _cache Now that we've gotten rid of the kmem_zone_t typedef, rename the variables to _cache since that's what they are. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
|
#
e7720afa |
|
27-Sep-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: remove kmem_zone typedef Remove these typedefs by referencing kmem_cache directly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
|
#
ed67ebfd |
|
11-Oct-2021 |
Christoph Hellwig <hch@lst.de> |
xfs: remove the xfs_dsb_t typedef Remove the few leftover instances of the xfs_dinode_t typedef. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
75c8c50f |
|
18-Aug-2021 |
Dave Chinner <dchinner@redhat.com> |
xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdown Remove the shouty macro and instead use the inline function that matches other state/feature check wrapper naming. This conversion was done with sed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
38c26bfd |
|
18-Aug-2021 |
Dave Chinner <dchinner@redhat.com> |
xfs: replace xfs_sb_version checks with feature flag checks Convert the xfs_sb_version_hasfoo() to checks against mp->m_features. Checks of the superblock itself during disk operations (e.g. in the read/write verifiers and the to/from disk formatters) are not converted - they operate purely on the superblock state. Everything else should use the mount features. Large parts of this conversion were done with sed with commands like this: for f in `git grep -l xfs_sb_version_has fs/xfs/*.c`; do sed -i -e 's/xfs_sb_version_has\(.*\)(&\(.*\)->m_sb)/xfs_has_\1(\2)/' $f done With manual cleanups for things like "xfs_has_extflgbit" and other little inconsistencies in naming. The result is ia lot less typing to check features and an XFS binary size reduced by a bit over 3kB: $ size -t fs/xfs/built-in.a text data bss dec hex filenam before 1130866 311352 484 1442702 16038e (TOTALS) after 1127727 311352 484 1439563 15f74b (TOTALS) Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
0020a190 |
|
10-Aug-2021 |
Dave Chinner <dchinner@redhat.com> |
xfs: AIL needs asynchronous CIL forcing The AIL pushing is stalling on log forces when it comes across pinned items. This is happening on removal workloads where the AIL is dominated by stale items that are removed from AIL when the checkpoint that marks the items stale is committed to the journal. This results is relatively few items in the AIL, but those that are are often pinned as directories items are being removed from are still being logged. As a result, many push cycles through the CIL will first issue a blocking log force to unpin the items. This can take some time to complete, with tracing regularly showing push delays of half a second and sometimes up into the range of several seconds. Sequences like this aren't uncommon: .... 399.829437: xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20 <wanted 20ms, got 270ms delay> 400.099622: xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0 400.099623: xfsaild: first lsn 0x11002f3600 400.099679: xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50 <wanted 50ms, got 500ms delay> 400.589348: xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0 400.589349: xfsaild: first lsn 0x1100305000 400.589595: xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50 <wanted 50ms, got 460ms delay> 400.950341: xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0 400.950343: xfsaild: first lsn 0x1100317c00 400.950436: xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20 <wanted 20ms, got 200ms delay> 401.142333: xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0 401.142334: xfsaild: first lsn 0x110032e600 401.142535: xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10 <wanted 10ms, got 10ms delay> 401.154323: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000 401.154328: xfsaild: first lsn 0x1100353000 401.154389: xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20 <wanted 20ms, got 300ms delay> 401.451525: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 401.451526: xfsaild: first lsn 0x1100353000 401.451804: xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50 <wanted 50ms, got 500ms delay> 401.933581: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 .... In each of these cases, every AIL pass saw 101 log items stuck on the AIL (pinned) with very few other items being found. Each pass, a log force was issued, and delay between last/first is the sleep time + the sync log force time. Some of these 101 items pinned the tail of the log. The tail of the log does slowly creep forward (first lsn), but the problem is that the log is actually out of reservation space because it's been running so many transactions that stale items that never reach the AIL but consume log space. Hence we have a largely empty AIL, with long term pins on items that pin the tail of the log that don't get pushed frequently enough to keep log space available. The problem is the hundreds of milliseconds that we block in the log force pushing the CIL out to disk. The AIL should not be stalled like this - it needs to run and flush items that are at the tail of the log with minimal latency. What we really need to do is trigger a log flush, but then not wait for it at all - we've already done our waiting for stuff to complete when we backed off prior to the log force being issued. Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we still do a blocking flush of the CIL and that is what is causing the issue. Hence we need a new interface for the CIL to trigger an immediate background push of the CIL to get it moving faster but not to wait on that to occur. While the CIL is pushing, the AIL can also be pushing. We already have an internal interface to do this - xlog_cil_push_now() - but we need a wrapper for it to be used externally. xlog_cil_force_seq() can easily be extended to do what we need as it already implements the synchronous CIL push via xlog_cil_push_now(). Add the necessary flags and "push current sequence" semantics to xlog_cil_force_seq() and convert the AIL pushing to use it. One of the complexities here is that the CIL push does not guarantee that the commit record for the CIL checkpoint is written to disk. The current log force ensures this by submitting the current ACTIVE iclog that the commit record was written to. We need the CIL to actually write this commit record to disk for an async push to ensure that the checkpoint actually makes it to disk and unpins the pinned items in the checkpoint on completion. Hence we need to pass down to the CIL push that we are doing an async flush so that it can switch out the commit_iclog if necessary to get written to disk when the commit iclog is finally released. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
2039a272 |
|
10-Aug-2021 |
Dave Chinner <dchinner@redhat.com> |
xfs: convert XLOG_FORCED_SHUTDOWN() to xlog_is_shutdown() Make it less shouty and a static inline before adding more calls through the log code. Also convert internal log code that uses XFS_FORCED_SHUTDOWN(mount) to use xlog_is_shutdown(log) as well. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
e8d04c2a |
|
06-Aug-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: use background worker pool when transactions can't get free space In xfs_trans_alloc, if the block reservation call returns ENOSPC, we call xfs_blockgc_free_space with a NULL icwalk structure to try to free space. Each frontend thread that encounters this situation starts its own walk of the inode cache to see if it can find anything, which is wasteful since we don't have any additional selection criteria. For this one common case, create a function that reschedules all pending background work immediately and flushes the workqueue so that the scan can run in parallel. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
5f9b4b0d |
|
18-Jun-2021 |
Dave Chinner <dchinner@redhat.com> |
xfs: xfs_log_force_lsn isn't passed a LSN In doing an investigation into AIL push stalls, I was looking at the log force code to see if an async CIL push could be done instead. This lead me to xfs_log_force_lsn() and looking at how it works. xfs_log_force_lsn() is only called from inode synchronisation contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn value as the LSN to sync the log to. This gets passed to xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the journal, and then used by xfs_log_force_lsn() to flush the iclogs to the journal. The problem is that ip->i_itemp->ili_last_lsn does not store a log sequence number. What it stores is passed to it from the ->iop_committing method, which is called by xfs_log_commit_cil(). The value this passes to the iop_committing method is the CIL context sequence number that the item was committed to. As it turns out, xlog_cil_force_lsn() converts the sequence to an actual commit LSN for the related context and returns that to xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn" variable that contained a sequence with an actual LSN and then uses that to sync the iclogs. This caused me some confusion for a while, even though I originally wrote all this code a decade ago. ->iop_committing is only used by a couple of log item types, and only inode items use the sequence number it is passed. Let's clean up the API, CIL structures and inode log item to call it a sequence number, and make it clear that the high level code is using CIL sequence numbers and not on-disk LSNs for integrity synchronisation purposes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
6543990a |
|
26-Apr-2021 |
Dave Chinner <dchinner@redhat.com> |
xfs: update superblock counters correctly for !lazysbcount Keep the mount superblock counters up to date for !lazysbcount filesystems so that when we log the superblock they do not need updating in any way because they are already correct. It's found by what Zorro reported: 1. mkfs.xfs -f -l lazy-count=0 -m crc=0 $dev 2. mount $dev $mnt 3. fsstress -d $mnt -p 100 -n 1000 (maybe need more or less io load) 4. umount $mnt 5. xfs_repair -n $dev and I've seen no problem with this patch. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reported-by: Zorro Lang <zlang@redhat.com> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com>
|
#
1aec7c3d |
|
23-Apr-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: remove obsolete AGF counter debugging In commit f8f2835a9cf3 we changed the behavior of XFS to use EFIs to remove blocks from an overfilled AGFL because there were complaints about transaction overruns that stemmed from trying to free multiple blocks in a single transaction. Unfortunately, that commit missed a subtlety in the debug-mode transaction accounting when a realtime volume is attached. If a realtime file undergoes a data fork mapping change such that realtime extents are allocated (or freed) in the same transaction that a data device block is also allocated (or freed), we can trip a debugging assertion. This can happen (for example) if a realtime extent is allocated and it is necessary to reshape the bmbt to hold the new mapping. When we go to allocate a bmbt block from an AG, the first thing the data device block allocator does is ensure that the freelist is the proper length. If the freelist is too long, it will trim the freelist to the proper length. In debug mode, trimming the freelist calls xfs_trans_agflist_delta() to record the decrement in the AG free list count. Prior to f8f28 we would put the free block back in the free space btrees in the same transaction, which calls xfs_trans_agblocks_delta() to record the increment in the AG free block count. Since AGFL blocks are included in the global free block count (fdblocks), there is no corresponding fdblocks update, so the AGFL free satisfies the following condition in xfs_trans_apply_sb_deltas: /* * Check that superblock mods match the mods made to AGF counters. */ ASSERT((tp->t_fdblocks_delta + tp->t_res_fdblocks_delta) == (tp->t_ag_freeblks_delta + tp->t_ag_flist_delta + tp->t_ag_btree_delta)); The comparison here used to be: (X + 0) == ((X+1) + -1 + 0), where X is the number blocks that were allocated. After commit f8f28 we defer the block freeing to the next chained transaction, which means that the calls to xfs_trans_agflist_delta and xfs_trans_agblocks_delta occur in separate transactions. The (first) transaction that shortens the free list trips on the comparison, which has now become: (X + 0) == ((X) + -1 + 0) because we haven't freed the AGFL block yet; we've only logged an intention to free it. When the second transaction (the deferred free) commits, it will evaluate the expression as: (0 + 0) == (1 + 0 + 0) and trip over that in turn. At this point, the astute reader may note that the two commits tagged by this patch have been in the kernel for a long time but haven't generated any bug reports. How is it that the author became aware of this bug? This originally surfaced as an intermittent failure when I was testing realtime rmap, but a different bug report by Zorro Lang reveals the same assertion occuring on !lazysbcount filesystems. The common factor to both reports (and why this problem wasn't previously reported) becomes apparent if we consider when xfs_trans_apply_sb_deltas is called by __xfs_trans_commit(): if (tp->t_flags & XFS_TRANS_SB_DIRTY) xfs_trans_apply_sb_deltas(tp); With a modern lazysbcount filesystem, transactions update only the percpu counters, so they don't need to set XFS_TRANS_SB_DIRTY, hence xfs_trans_apply_sb_deltas is rarely called. However, updates to the count of free realtime extents are not part of lazysbcount, so XFS_TRANS_SB_DIRTY will be set on transactions adding or removing data fork mappings to realtime files; similarly, XFS_TRANS_SB_DIRTY is always set on !lazysbcount filesystems. Dave mentioned in response to an earlier version of this patch: "IIUC, what you are saying is that this debug code is simply not exercised in normal testing and hasn't been for the past decade? And it still won't be exercised on anything other than realtime device testing? "...it was debugging code from 1994 that was largely turned into dead code when lazysbcounters were introduced in 2007. Hence I'm not sure it holds any value anymore." This debugging code isn't especially helpful - you can modify the flcount on one AG and the freeblks of another AG, and it won't trigger. Add the fact that nobody noticed for a decade, and let's just get rid of it (and start testing realtime :P). This bug was found by running generic/051 on either a V4 filesystem lacking lazysbcount; or a V5 filesystem with a realtime volume. Cc: bfoster@redhat.com, zlang@redhat.com Fixes: f8f2835a9cf3 ("xfs: defer agfl block frees when dfops is available") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com>
|
#
6e73a545 |
|
29-Mar-2021 |
Christoph Hellwig <hch@lst.de> |
xfs: move the di_nblocks field to struct xfs_inode In preparation of removing the historic icinode struct, move the nblocks field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
fb2fc172 |
|
23-Mar-2021 |
Gao Xiang <hsiangkao@redhat.com> |
xfs: support shrinking unused space in the last AG As the first step of shrinking, this attempts to enable shrinking unused space in the last allocation group by fixing up freespace btree, agi, agf and adjusting super block and use a helper xfs_ag_shrink_space() to fixup the last AG. This can be all done in one transaction for now, so I think no additional protection is needed. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
5825bea0 |
|
22-Mar-2021 |
Dave Chinner <dchinner@redhat.com> |
xfs: __percpu_counter_compare() inode count debug too expensive - 21.92% __xfs_trans_commit - 21.62% xfs_log_commit_cil - 11.69% xfs_trans_unreserve_and_mod_sb - 11.58% __percpu_counter_compare - 11.45% __percpu_counter_sum - 10.29% _raw_spin_lock_irqsave - 10.28% do_raw_spin_lock __pv_queued_spin_lock_slowpath We debated just getting rid of it last time this came up and there was no real objection to removing it. Now it's the biggest scalability limitation for debug kernels even on smallish machines, so let's just get rid of it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
#
756b1c34 |
|
23-Feb-2021 |
Dave Chinner <dchinner@redhat.com> |
xfs: use current->journal_info for detecting transaction recursion Because the iomap code using PF_MEMALLOC_NOFS to detect transaction recursion in XFS is just wrong. Remove it from the iomap code and replace it with XFS specific internal checks using current->journal_info instead. [djwong: This change also realigns the lifetime of NOFS flag changes to match the incore transaction, instead of the inconsistent scheme we have now.] Fixes: 9070733b4efa ("xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
9febcda6 |
|
19-Feb-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: don't nest transactions when scanning for eofblocks Brian Foster reported a lockdep warning on xfs/167: ============================================ WARNING: possible recursive locking detected 5.11.0-rc4 #35 Tainted: G W I -------------------------------------------- fsstress/17733 is trying to acquire lock: ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_free_eofblocks+0x104/0x1d0 [xfs] but task is already holding lock: ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_trans_alloc_inode+0x5f/0x160 [xfs] stack backtrace: CPU: 38 PID: 17733 Comm: fsstress Tainted: G W I 5.11.0-rc4 #35 Hardware name: Dell Inc. PowerEdge R740/01KPX8, BIOS 1.6.11 11/20/2018 Call Trace: dump_stack+0x8b/0xb0 __lock_acquire.cold+0x159/0x2ab lock_acquire+0x116/0x370 xfs_trans_alloc+0x1ad/0x310 [xfs] xfs_free_eofblocks+0x104/0x1d0 [xfs] xfs_blockgc_scan_inode+0x24/0x60 [xfs] xfs_inode_walk_ag+0x202/0x4b0 [xfs] xfs_inode_walk+0x66/0xc0 [xfs] xfs_trans_alloc+0x160/0x310 [xfs] xfs_trans_alloc_inode+0x5f/0x160 [xfs] xfs_alloc_file_space+0x105/0x300 [xfs] xfs_file_fallocate+0x270/0x460 [xfs] vfs_fallocate+0x14d/0x3d0 __x64_sys_fallocate+0x3e/0x70 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The cause of this is the new code that spurs a scan to garbage collect speculative preallocations if we fail to reserve enough blocks while allocating a transaction. While the warning itself is a fairly benign lockdep complaint, it does expose a potential livelock if the rwsem behavior ever changes with regards to nesting read locks when someone's waiting for a write lock. Fix this by freeing the transaction and jumping back to xfs_trans_alloc like this patch in the V4 submission[1]. [1] https://lore.kernel.org/linux-xfs/161142798066.2171939.9311024588681972086.stgit@magnolia/ Fixes: a1a7d05a0576 ("xfs: flush speculative space allocations when we run out of space") Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
a1a7d05a |
|
22-Jan-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: flush speculative space allocations when we run out of space If a fs modification (creation, file write, reflink, etc.) is unable to reserve enough space to handle the modification, try clearing whatever space the filesystem might have been hanging onto in the hopes of speeding up the filesystem. The flushing behavior will become particularly important when we add deferred inode inactivation because that will increase the amount of space that isn't actively tied to user data. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
|
#
758303d1 |
|
22-Jan-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: flush eof/cowblocks if we can't reserve quota for chown If a file user, group, or project change is unable to reserve enough quota to handle the modification, try clearing whatever space the filesystem might have been hanging onto in the hopes of speeding up the filesystem. The flushing behavior will become particularly important when we add deferred inode inactivation because that will increase the amount of space that isn't actively tied to user data. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
|
#
c237dd7c |
|
22-Jan-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: flush eof/cowblocks if we can't reserve quota for inode creation If an inode creation is unable to reserve enough quota to handle the modification, try clearing whatever space the filesystem might have been hanging onto in the hopes of speeding up the filesystem. The flushing behavior will become particularly important when we add deferred inode inactivation because that will increase the amount of space that isn't actively tied to user data. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
|
#
766aabd5 |
|
22-Jan-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: flush eof/cowblocks if we can't reserve quota for file blocks If a fs modification (data write, reflink, xattr set, fallocate, etc.) is unable to reserve enough quota to handle the modification, try clearing whatever space the filesystem might have been hanging onto in the hopes of speeding up the filesystem. The flushing behavior will become particularly important when we add deferred inode inactivation because that will increase the amount of space that isn't actively tied to user data. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
|
#
5c615f0f |
|
01-Feb-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: remove xfs_qm_vop_chown_reserve Now that the only caller of this function is xfs_trans_alloc_ichange, just open-code the meat of _chown_reserve in that caller. Drop the (redundant) [ugp]id checks because xfs has a 1:1 relationship between quota ids and incore dquots. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
7317a03d |
|
29-Jan-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: refactor inode ownership change transaction/inode/quota allocation idiom For file ownership (uid, gid, prid) changes, create a new helper xfs_trans_alloc_ichange that allocates a transaction and reserves the appropriate amount of quota against that transction in preparation for a change of user, group, or project id. Replace all the open-coded idioms with a single call to this helper so that we can contain the retry loops in the next patchset. This changes the locking behavior for ichange transactions slightly. Since tr_ichange does not have a permanent reservation and cannot roll, we pass XFS_ILOCK_EXCL to ijoin so that the inode will be unlocked automatically at commit time. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
|
#
f2f7b9ff |
|
27-Jan-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: refactor inode creation transaction/inode/quota allocation idiom For file creation, create a new helper xfs_trans_alloc_icreate that allocates a transaction and reserves the appropriate amount of quota against that transction. Replace all the open-coded idioms with a single call to this helper so that we can contain the retry loops in the next patchset. This changes the locking behavior for non-tempfile creation slightly, in that we now make the quota reservation without holding the directory ILOCK. While the dquots chosen for inode creation are based on the directory state at a given point in time, the directory ILOCK was released as soon as the dquot references are picked up. Hence it was never necessary to hold the directory ILOCK for the quota reservation. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
3de4eb10 |
|
26-Jan-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: allow reservation of rtblocks with xfs_trans_alloc_inode Make it so that we can reserve rt blocks with the xfs_trans_alloc_inode wrapper function, then convert a few more callsites. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
|
#
3a1af6c3 |
|
26-Jan-2021 |
Darrick J. Wong <djwong@kernel.org> |
xfs: refactor common transaction/inode/quota allocation idiom Create a new helper xfs_trans_alloc_inode that allocates a transaction, locks and joins an inode to it, and then reserves the appropriate amount of quota against that transction. Then replace all the open-coded idioms with a single call to this helper. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
|
#
e8222613 |
|
16-Dec-2020 |
Dave Chinner <dchinner@redhat.com> |
xfs: remove xfs_buf_t typedef Prepare for kernel xfs_buf alignment by getting rid of the xfs_buf_t typedef from userspace. [darrick: This patch is a port of a userspace patch removing the xfs_buf_t typedef in preparation to make the userspace xfs_buf code behave more like its kernel counterpart.] Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
d6b8fc6c |
|
23-Sep-2020 |
Kaixu Xia <kaixuxia@tencent.com> |
xfs: do the assert for all the log done items in xfs_trans_cancel We should do the assert for all the log intent-done items if they appear here. This patch detect intent-done items by the fact that their item ops don't have iop_unpin and iop_push methods and also move the helper xlog_item_is_intent to xfs_trans.h. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
cead0b10 |
|
01-Sep-2020 |
Christoph Hellwig <hch@lst.de> |
xfs: simplify xfs_trans_getsb Remove the mp argument as this function is only called in transaction context, and open code xfs_getsb given that the function already accesses the buffer pointer in the mount point directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
32a2b11f |
|
22-Jul-2020 |
Carlos Maiolino <cmaiolino@redhat.com> |
xfs: Remove kmem_zone_zalloc() usage Use kmem_cache_zalloc() directly. With the exception of xlog_ticket_alloc() which will be dealt on the next patch for readability. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
f74681ba |
|
29-Jun-2020 |
Brian Foster <bfoster@redhat.com> |
xfs: preserve rmapbt swapext block reservation from freed blocks The rmapbt extent swap algorithm remaps individual extents between the source inode and the target to trigger reverse mapping metadata updates. If either inode straddles a format or other bmap allocation boundary, the individual unmap and map cycles can trigger repeated bmap block allocations and frees as the extent count bounces back and forth across the boundary. While net block usage is bound across the swap operation, this behavior can prematurely exhaust the transaction block reservation because it continuously drains as the transaction rolls. Each allocation accounts against the reservation and each free returns to global free space on transaction roll. The previous workaround to this problem attempted to detect this boundary condition and provide surplus block reservation to acommodate it. This is insufficient because more remaps can occur than implied by the extent counts; if start offset boundaries are not aligned between the two inodes, for example. To address this problem more generically and dynamically, add a transaction accounting mode that returns freed blocks to the transaction reservation instead of the superblock counters on transaction roll and use it when the rmapbt based algorithm is active. This allows the chain of remap transactions to preserve the block reservation based own its own frees and prevent premature exhaustion regardless of the remap pattern. Note that this is only safe for superblocks with lazy sb accounting, but the latter is required for v5 supers and the rmap feature depends on v5. Fixes: b3fed434822d0 ("xfs: account format bouncing into rmapbt swapext tx reservation") Root-caused-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
b41b46c2 |
|
20-May-2020 |
Dave Chinner <dchinner@redhat.com> |
xfs: remove the m_active_trans counter It's a global atomic counter, and we are hitting it at a rate of half a million transactions a second, so it's bouncing the counter cacheline all over the place on large machines. We don't actually need it anymore - it used to be required because the VFS freeze code could not track/prevent filesystem transactions that were running, but that problem no longer exists. Hence to remove the counter, we simply have to ensure that nothing calls xfs_sync_sb() while we are trying to quiesce the filesytem. That only happens if the log worker is still running when we call xfs_quiesce_attr(). The log worker is cancelled at the end of xfs_quiesce_attr() by calling xfs_log_quiesce(), so just call it early here and then we can remove the counter altogether. Concurrent create, 50 million inodes, identical 16p/16GB virtual machines on different physical hosts. Machine A has twice the CPU cores per socket of machine B: unpatched patched machine A: 3m16s 2m00s machine B: 4m04s 4m05s Create rates: unpatched patched machine A: 282k+/-31k 468k+/-21k machine B: 231k+/-8k 233k+/-11k Concurrent rm of same 50 million inodes: unpatched patched machine A: 6m42s 2m33s machine B: 4m47s 4m47s The transaction rate on the fast machine went from just under 300k/sec to 700k/sec, which indicates just how much of a bottleneck this atomic counter was. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
f18c9a90 |
|
20-May-2020 |
Dave Chinner <dchinner@redhat.com> |
xfs: reduce free inode accounting overhead Shaokun Zhang reported that XFS was using substantial CPU time in percpu_count_sum() when running a single threaded benchmark on a high CPU count (128p) machine from xfs_mod_ifree(). The issue is that the filesystem is empty when the benchmark runs, so inode allocation is running with a very low inode free count. With the percpu counter batching, this means comparisons when the counter is less that 128 * 256 = 32768 use the slow path of adding up all the counters across the CPUs, and this is expensive on high CPU count machines. The summing in xfs_mod_ifree() is only used to fire an assert if an underrun occurs. The error is ignored by the higher level code. Hence this is really just debug code and we don't need to run it on production kernels, nor do we need such debug checks to return error values just to trigger an assert. Finally, xfs_mod_icount/xfs_mod_ifree are only called from xfs_trans_unreserve_and_mod_sb(), so get rid of them and just directly call the percpu_counter_add/percpu_counter_compare functions. The compare functions are now run only on debug builds as they are internal to ASSERT() checks and so only compiled in when ASSERTs are active (CONFIG_XFS_DEBUG=y or CONFIG_XFS_WARN=y). Reported-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
dc3ffbb1 |
|
20-May-2020 |
Dave Chinner <david@fromorbit.com> |
xfs: gut error handling in xfs_trans_unreserve_and_mod_sb() xfs: gut error handling in xfs_trans_unreserve_and_mod_sb() From: Dave Chinner <dchinner@redhat.com> The error handling in xfs_trans_unreserve_and_mod_sb() is largely incorrect - rolling back the changes in the transaction if only one counter underruns makes all the other counters incorrect. We still allow the change to proceed and committing the transaction, except now we have multiple incorrect counters instead of a single underflow. Further, we don't actually report the error to the caller, so this is completely silent except on debug kernels that will assert on failure before we even get to the rollback code. Hence this error handling is broken, untested, and largely unnecessary complexity. Just remove it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
8b41e3f9 |
|
25-Mar-2020 |
Christoph Hellwig <hch@lst.de> |
xfs: split xlog_ticket_done Remove xlog_ticket_done and just call the renamed low-level helpers for ungranting or regranting log space directly. To make that a little the reference put on the ticket and all tracing is moved into the actual helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
dd401770 |
|
25-Mar-2020 |
Dave Chinner <dchinner@redhat.com> |
xfs: refactor and split xfs_log_done() xfs_log_done() does two separate things. Firstly, it triggers commit records to be written for permanent transactions, and secondly it releases or regrants transaction reservation space. Since delayed logging was introduced, transactions no longer write directly to the log, hence they never have the XLOG_TIC_INITED flag cleared on them. Hence transactions never write commit records to the log and only need to modify reservation space. Split up xfs_log_done into two parts, and only call the parts of the operation needed for the context xfs_log_done() is currently being called from. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
27fb5a72 |
|
25-Mar-2020 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: prohibit fs freezing when using empty transactions I noticed that fsfreeze can take a very long time to freeze an XFS if there happens to be a GETFSMAP caller running in the background. I also happened to notice the following in dmesg: ------------[ cut here ]------------ WARNING: CPU: 2 PID: 43492 at fs/xfs/xfs_super.c:853 xfs_quiesce_attr+0x83/0x90 [xfs] Modules linked in: xfs libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip_set_hash_ip ip_set_hash_net xt_tcpudp xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables bfq iptable_filter sch_fq_codel ip_tables x_tables nfsv4 af_packet [last unloaded: xfs] CPU: 2 PID: 43492 Comm: xfs_io Not tainted 5.6.0-rc4-djw #rc4 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014 RIP: 0010:xfs_quiesce_attr+0x83/0x90 [xfs] Code: 7c 07 00 00 85 c0 75 22 48 89 df 5b e9 96 c1 00 00 48 c7 c6 b0 2d 38 a0 48 89 df e8 57 64 ff ff 8b 83 7c 07 00 00 85 c0 74 de <0f> 0b 48 89 df 5b e9 72 c1 00 00 66 90 0f 1f 44 00 00 41 55 41 54 RSP: 0018:ffffc900030f3e28 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff88802ac54000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff81e4a6f0 RDI: 00000000ffffffff RBP: ffff88807859f070 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000010 R12: 0000000000000000 R13: ffff88807859f388 R14: ffff88807859f4b8 R15: ffff88807859f5e8 FS: 00007fad1c6c0fc0(0000) GS:ffff88807e000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f0c7d237000 CR3: 0000000077f01003 CR4: 00000000001606a0 Call Trace: xfs_fs_freeze+0x25/0x40 [xfs] freeze_super+0xc8/0x180 do_vfs_ioctl+0x70b/0x750 ? __fget_files+0x135/0x210 ksys_ioctl+0x3a/0xb0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x50/0x1a0 entry_SYSCALL_64_after_hwframe+0x49/0xbe These two things appear to be related. The assertion trips when another thread initiates a fsmap request (which uses an empty transaction) after the freezer waited for m_active_trans to hit zero but before the the freezer executes the WARN_ON just prior to calling xfs_log_quiesce. The lengthy delays in freezing happen because the freezer calls xfs_wait_buftarg to clean out the buffer lru list. Meanwhile, the GETFSMAP caller is continuing to grab and release buffers, which means that it can take a very long time for the buffer lru list to empty out. We fix both of these races by calling sb_start_write to obtain freeze protection while using empty transactions for GETFSMAP and for metadata scrubbing. The other two users occur during mount, during which time we cannot fs freeze. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
3e6e8afd |
|
10-Mar-2020 |
Christoph Hellwig <hch@lst.de> |
xfs: remove XFS_BUF_TO_SBP Just dereference bp->b_addr directly and make the code a little simpler and more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
377bcd5f |
|
14-Nov-2019 |
Carlos Maiolino <cmaiolino@redhat.com> |
xfs: Remove kmem_zone_free() wrapper We can remove it now, without needing to rework the KM_ flags. Use kmem_cache_free() directly. Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
707e0dda |
|
26-Aug-2019 |
Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> |
fs: xfs: Remove KM_NOSLEEP and KM_SLEEP. Since no caller is using KM_NOSLEEP and no callee branches on KM_SLEEP, we can remove KM_NOSLEEP and replace KM_SLEEP with 0. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
73d30d48 |
|
28-Jun-2019 |
Christoph Hellwig <hch@lst.de> |
xfs: remove XFS_TRANS_NOFS Instead of a magic flag for xfs_trans_alloc, just ensure all callers that can't relclaim through the file system use memalloc_nofs_save to set the per-task nofs flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
250d4b4c |
|
28-Jun-2019 |
Eric Sandeen <sandeen@sandeen.net> |
xfs: remove unused header files There are many, many xfs header files which are included but unneeded (or included twice) in the xfs code, so remove them. nb: xfs_linux.h includes about 9 headers for everyone, so those explicit includes get removed by this. I'm not sure what the preference is, but if we wanted explicit includes everywhere, a followup patch could remove those xfs_*.h includes from xfs_linux.h and move them into the files that need them. Or it could be left as-is. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
9ce632a2 |
|
28-Jun-2019 |
Christoph Hellwig <hch@lst.de> |
xfs: add a flag to release log items on commit We have various items that are released from ->iop_comitting. Add a flag to just call ->iop_release from the commit path to avoid tons of boilerplate code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
ddf92053 |
|
28-Jun-2019 |
Christoph Hellwig <hch@lst.de> |
xfs: split iop_unlock The iop_unlock method is called when comitting or cancelling a transaction. In the latter case, the transaction may or may not be aborted. While there is no known problem with the current code in practice, this implementation is limited in that any log item implementation that might want to differentiate between a commit and a cancellation must rely on the aborted state. The aborted bit is only set when the cancelled transaction is dirty, however. This means that there is no way to distinguish between a commit and a clean transaction cancellation. For example, intent log items currently rely on this distinction. The log item is either transferred to the CIL on commit or released on transaction cancel. There is currently no possibility for a clean intent log item in a transaction, but if that state is ever introduced a cancel of such a transaction will immediately result in memory leaks of the associated log item(s). This is an interface deficiency and landmine. To clean this up, replace the iop_unlock method with an iop_release method that is specific to transaction cancel. The existing iop_committing method occurs at the same time as iop_unlock in the commit path and there is no need for two separate callbacks here. Overload the iop_committing method with the current commit time iop_unlock implementations to eliminate the need for the latter and further simplify the interface. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
195cd83d |
|
28-Jun-2019 |
Christoph Hellwig <hch@lst.de> |
xfs: don't use xfs_trans_free_items in the commit path While commiting items looks very similar to freeing them on error it is a different operation, and they will diverge a bit soon. Split out the commit case from xfs_trans_free_items, inline it into xfs_log_commit_cil and give it a separate trace point. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
e8b78db7 |
|
28-Jun-2019 |
Christoph Hellwig <hch@lst.de> |
xfs: don't require log items to implement optional methods Just check if they are present first. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
d15cbf2f |
|
28-Jun-2019 |
Christoph Hellwig <hch@lst.de> |
xfs: stop using XFS_LI_ABORTED as a parameter flag Just pass a straight bool aborted instead of abusing XFS_LI_ABORTED as a flag in function parameters. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
086252c3 |
|
28-Jun-2019 |
Christoph Hellwig <hch@lst.de> |
xfs: fix a trivial comment typo in xfs_trans_committed_bulk Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
8c9ce2f7 |
|
12-Jun-2019 |
Eric Sandeen <sandeen@sandeen.net> |
xfs: remove unused flags arg from getsb interfaces The flags value is always passed as 0 so remove the argument. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
8683edb7 |
|
28-Sep-2018 |
Dave Chinner <dchinner@redhat.com> |
xfs: avoid lockdep false positives in xfs_trans_alloc We've had a few reports of lockdep tripping over memory reclaim context vs filesystem freeze "deadlocks". They all have looked to be false positives on analysis, but it seems that they are being tripped because we take freeze references before we run a GFP_KERNEL allocation for the struct xfs_trans. We can avoid this false positive vector just by re-ordering the operations in xfs_trans_alloc(). That is. we need allocate the structure before we take the freeze reference and enter the GFP_NOFS allocation context that follows the xfs_trans around. This prevents lockdep from seeing the GFP_KERNEL allocation inside the transaction context, and that prevents it from triggering the freeze level vs alloc context vs reclaim warnings. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
9d9e6233 |
|
01-Aug-2018 |
Brian Foster <bfoster@redhat.com> |
xfs: fold dfops into the transaction struct xfs_defer_ops has now been reduced to a single list_head. The external dfops mechanism is unused and thus everywhere a (permanent) transaction is accessible the associated dfops structure is as well. Remove the xfs_defer_ops structure and fold the list_head into the transaction. Also remove the last remnant of external dfops in xfs_trans_dup(). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
1ae093cb |
|
01-Aug-2018 |
Brian Foster <bfoster@redhat.com> |
xfs: replace xfs_defer_ops ->dop_pending with on-stack list The xfs_defer_ops ->dop_pending list is used to track active deferred operations once intents are logged. These items must be aborted in the event of an error. The list is populated as intents are logged and items are removed as they complete (or are aborted). Now that xfs_defer_finish() cancels on error, there is no need to ever access ->dop_pending outside of xfs_defer_finish(). The list is only ever populated after xfs_defer_finish() begins and is either completed or cancelled before it returns. Remove ->dop_pending from xfs_defer_ops and replace it with a local list in the xfs_defer_finish() path. Pass the local list to the various helpers now that it is not accessible via dfops. Note that we have to check for NULL in the abort case as the final tx roll occurs outside of the scope of the new local list (once the dfops has completed and thus drained the list). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
9b1f4e98 |
|
01-Aug-2018 |
Brian Foster <bfoster@redhat.com> |
xfs: cancel dfops on xfs_defer_finish() error The current semantics of xfs_defer_finish() require the caller to call xfs_defer_cancel() on error. This is slightly inconsistent with transaction commit error handling where a failed commit cleans up the transaction before returning. More significantly, the only requirement for exposure of ->dop_pending outside of xfs_defer_finish() is so that xfs_defer_cancel() can drain it on error. Since the only recourse of xfs_defer_finish() errors is cancellation, mirror the transaction logic and cancel remaining dfops before returning from xfs_defer_finish() with an error. Beside simplifying xfs_defer_finish() semantics, this ensures that xfs_defer_finish() always returns with an empty ->dop_pending and thus facilitates removal of the list from xfs_defer_ops. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
60f31a60 |
|
01-Aug-2018 |
Brian Foster <bfoster@redhat.com> |
xfs: clean out superfluous dfops dop params/vars The dfops code still passes around the xfs_defer_ops pointer superfluously in a few places. Clean this up wherever the transaction will suffice. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
ce356d64 |
|
01-Aug-2018 |
Brian Foster <bfoster@redhat.com> |
xfs: pass transaction to dfops reset/move helpers All callers pass ->t_dfops of the associated transactions. Refactor the helpers to receive the transactions and facilitate further cleanups between xfs_defer_ops and xfs_trans. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
7279aa13 |
|
01-Aug-2018 |
Brian Foster <bfoster@redhat.com> |
xfs: remove unused __xfs_defer_cancel() internal helper With no more external dfops users, there is no need for an xfs_defer_ops cancel wrapper. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
98719051 |
|
01-Aug-2018 |
Brian Foster <bfoster@redhat.com> |
xfs: refactor internal dfops initialization The current transaction allocation code conditionally initializes the ->t_dfops indirection pointer. Transaction commit/cancel check the validity of the pointer to determine whether to finish/cancel the internal dfops. This disallows the ability to use the internal dfops list as a temporary container (via xfs_trans_alloc_empty()). Refactor transaction allocation to always initialize ->t_dfops and check permanent reservation state on transaction commit/cancel. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
b277c37f |
|
24-Jul-2018 |
Brian Foster <bfoster@redhat.com> |
xfs: bypass final dfops roll in trans commit path Once xfs_defer_finish() has completed all deferred operations, it checks the dirty state of the transaction and rolls it once more to return a clean transaction for the caller. This primarily to cover the case where repeated xfs_defer_finish() calls are made in a loop and we need to make sure that the caller starts the next iteration with a clean transaction. Otherwise we risk transaction reservation overrun. This final transaction roll is not required in the transaction commit path, however, because the transaction is immediately committed and freed after dfops completion. Refactor the final roll into a separate helper such that we can avoid it in the transaction commit path. Lift the dfops reset as well so dfops remains valid until after the last call to xfs_defer_trans_roll(). The reset is also unnecessary in the transaction commit path because the transaction is about to complete. This eliminates unnecessary regrants of transactions where the associated transaction roll can be replaced by a transaction commit. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
9e28a242 |
|
24-Jul-2018 |
Brian Foster <bfoster@redhat.com> |
xfs: drop unnecessary xfs_defer_finish() dfops parameter Every caller of xfs_defer_finish() now passes the transaction and its associated ->t_dfops. The xfs_defer_ops parameter is therefore no longer necessary and can be removed. Since most xfs_defer_finish() callers also have to consider xfs_defer_cancel() on error, update the latter to also receive the transaction for consistency. The log recovery code contains an outlier case that cancels a dfops directly without an available transaction. Retain an internal wrapper to support this outlier case for the time being. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
e021a2e5 |
|
24-Jul-2018 |
Brian Foster <bfoster@redhat.com> |
xfs: support embedded dfops in transaction The dfops structure used by multi-transaction operations is typically stored on the stack and carried around by the associated transaction. The lifecycle of dfops does not quite match that of the transaction, but they are tightly related in that the former depends on the latter. The relationship of these objects is tight enough that we can avoid the cumbersome boilerplate code required in most cases to manage them separately by just embedding an xfs_defer_ops in the transaction itself. This means that a transaction allocation returns with an initialized dfops, a transaction commit finishes pending deferred items before the tx commit, a transaction cancel cancels the dfops before the transaction and a transaction dup operation transfers the current dfops state to the new transaction. The dup operation is slightly complicated by the fact that we can no longer just copy a dfops pointer from the old transaction to the new transaction. This is solved through a dfops move helper that transfers the pending items and other dfops state across the transactions. This also requires that transaction rolling code always refer to the transaction for the current dfops reference. Finally, to facilitate incremental conversion to the internal dfops and continue to support the current external dfops mode of operation, create the new ->t_dfops_internal field with a layer of indirection. On allocation, ->t_dfops points to the internal dfops. This state is overridden by callers who re-init a local dfops on the transaction. Once ->t_dfops is overridden, the external dfops reference is maintained as the transaction rolls. This patch adds the fundamental ability to support an internal dfops. All codepaths that perform deferred processing continue to override the internal dfops until they are converted over in subsequent patches. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
bba59c5e |
|
11-Jul-2018 |
Brian Foster <bfoster@redhat.com> |
xfs: add firstblock field to xfs_trans A firstblock var is typically allocated and initialized along with xfs_defer_ops structures and passed around independent from the associated transaction. To facilitate combining the two, add an optional ->t_firstblock field to xfs_trans that can be used in place of an on-stack variable. The firstblock value follows the lifetime of the transaction, so initialize it on allocation and when a transaction rolls. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
6aa67184 |
|
11-Jul-2018 |
Brian Foster <bfoster@redhat.com> |
xfs: rename xfs_trans ->t_agfl_dfops to ->t_dfops The ->t_agfl_dfops field is currently used to defer agfl block frees from associated transaction contexts. While all known problematic contexts have already been updated to use ->t_agfl_dfops, the broader goal is defer agfl frees from all callers that already use a deferred operations structure. Further, the transaction field facilitates a good amount of code clean up where the transaction and dfops have historically been passed down through the stack separately. Rename the field to something more generic to prepare to use it as such throughout XFS. This patch does not change behavior. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
10ee2526 |
|
22-Jun-2018 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: allow empty transactions while frozen In commit e89c041338ed6ef ("xfs: implement the GETFSMAP ioctl") we created the ability to obtain empty transactions. These transactions have no log or block reservations and therefore can't modify anything. Since they're also NO_WRITECOUNT they can run while the fs is frozen, so we don't need to WARN_ON about that usage. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
0b61f8a4 |
|
05-Jun-2018 |
Dave Chinner <dchinner@redhat.com> |
xfs: convert to SPDX license tags Remove the verbose license text from XFS files and replace them with SPDX tags. This does not change the license of any of the code, merely refers to the common, up-to-date license files in LICENSES/ This change was mostly scripted. fs/xfs/Makefile and fs/xfs/libxfs/xfs_fs.h were modified by hand, the rest were detected and modified by the following command: for f in `git grep -l "GNU General" fs/xfs/` ; do echo $f cat $f | awk -f hdr.awk > $f.new mv -f $f.new $f done And the hdr.awk script that did the modification (including detecting the difference between GPL-2.0 and GPL-2.0+ licenses) is as follows: $ cat hdr.awk BEGIN { hdr = 1.0 tag = "GPL-2.0" str = "" } /^ \* This program is free software/ { hdr = 2.0; next } /any later version./ { tag = "GPL-2.0+" next } /^ \*\// { if (hdr > 0.0) { print "// SPDX-License-Identifier: " tag print str print $0 str="" hdr = 0.0 next } print $0 next } /^ \* / { if (hdr > 1.0) next if (hdr > 0.0) { if (str != "") str = str "\n" str = str $0 next } print $0 next } /^ \*/ { if (hdr > 0.0) next print $0 next } // { if (hdr > 0.0) { if (str != "") str = str "\n" str = str $0 next } print $0 } END { } $ Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
e6631f85 |
|
09-May-2018 |
Dave Chinner <dchinner@redhat.com> |
xfs: get rid of the log item descriptor It's just a connector between a transaction and a log item. There's a 1:1 relationship between a log item descriptor and a log item, and a 1:1 relationship between a log item descriptor and a transaction. Both relationships are created and terminated at the same time, so why do we even have the descriptor? Replace it with a specific list_head in the log item and a new log item dirtied flag to replace the XFS_LID_DIRTY flag. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> [darrick: fix up deferred agfl intent finish_item use of LID_DIRTY] Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
ba18781b |
|
09-May-2018 |
Dave Chinner <dchinner@redhat.com> |
xfs: add tracing to high level transaction operations Because currently we have no idea what the transaction context we are operating in is, and I need to know that information to track down bugs in multiple log item joins to transactions. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
22525c17 |
|
09-May-2018 |
Dave Chinner <dchinner@redhat.com> |
xfs: log item flags are racy The log item flags contain a field that is protected by the AIL lock - the XFS_LI_IN_AIL flag. We use non-atomic RMW operations to set and clear these flags, but most of the updates and checks are not done with the AIL lock held and so are susceptible to update races. Fix this by changing the log item flags to use atomic bitops rather than be reliant on the AIL lock for update serialisation. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
f8f2835a |
|
07-May-2018 |
Brian Foster <bfoster@redhat.com> |
xfs: defer agfl block frees when dfops is available The AGFL fixup code executes before every block allocation/free and rectifies the AGFL based on the current, dynamic allocation requirements of the fs. The AGFL must hold a minimum number of blocks to satisfy a worst case split of the free space btrees caused by the impending allocation operation. The AGFL is also updated to maintain the implicit requirement for a minimum number of free slots to satisfy a worst case join of the free space btrees. Since the AGFL caches individual blocks, AGFL reduction typically involves multiple, single block frees. We've had reports of transaction overrun problems during certain workloads that boil down to AGFL reduction freeing multiple blocks and consuming more space in the log than was reserved for the transaction. Since the objective of freeing AGFL blocks is to ensure free AGFL free slots are available for the upcoming allocation, one way to address this problem is to release surplus blocks from the AGFL immediately but defer the free of those blocks (similar to how file-mapped blocks are unmapped from the file in one transaction and freed via a deferred operation) until the transaction is rolled. This turns AGFL reduction into an operation with predictable log reservation consumption. Add the capability to defer AGFL block frees when a deferred ops list is available to the AGFL fixup code. Add a dfops pointer to the transaction to carry dfops through various contexts to the allocator context. Deferring AGFL frees is conditional behavior based on whether the transaction pointer is populated. The long term objective is to reuse the transaction pointer to clean up all unrelated callchains that pass dfops on the stack along with a transaction and in doing so, consistently defer AGFL blocks from the allocator. A bit of customization is required to handle deferred completion processing because AGFL blocks are accounted against a per-ag reservation pool and AGFL blocks are not inserted into the extent busy list when freed (they are inserted when used and released back to the AGFL). Reuse the majority of the existing deferred extent free infrastructure and customize it appropriately to handle AGFL blocks. Note that this patch only adds infrastructure. It does not change behavior because no callers have been updated to pass ->t_agfl_dfops into the allocation code. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
656de4ff |
|
14-Mar-2018 |
Christoph Hellwig <hch@lst.de> |
xfs: merge _xfs_log_force_lsn and xfs_log_force_lsn Switch to a single interface for flushing the log to a specific LSN, which gives consistent trace point coverage and a less confusing interface. The was only a single user of the previous xfs_log_force_lsn function, which now also passes a NULL log_flushed argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
3e78b9a4 |
|
09-Mar-2018 |
Brian Foster <bfoster@redhat.com> |
xfs: shutdown if block allocation overruns tx reservation The ->t_blk_res_used field tracks how many blocks have been used in the current transaction. This should never exceed the block reservation (->t_blk_res) for a particular transaction. We currently assert this condition in the transaction block accounting code, but otherwise take no additional action should this situation occur. The overrun generally has no effect if space ends up being available and the associated transaction commits. If the transaction is duplicated, however, the current block usage is used to determine the remaining block reservation to be transferred to the new transaction. If usage exceeds reservation, this calculation underflows and creates a transaction with an invalid and excessive reservation. When the second transaction commits, the release of unused blocks corrupts the in-core free space counters. With lazy superblock accounting enabled, this inconsistency eventually trickles to the on-disk superblock and corrupts the filesystem. Replace the transaction block usage accounting assert with an explicit overrun check. If the transaction overruns the reservation, shutdown the filesystem immediately to prevent corruption. Add a new assert to xfs_trans_dup() to catch any callers that might induce this invalid state in the future. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
57e80956 |
|
07-Mar-2018 |
Matthew Wilcox <willy@infradead.org> |
xfs: Rename xa_ elements to ail_ This is a simple rename, except that xa_ail becomes ail_head. Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
b872af2c |
|
08-Jan-2018 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: trace log reservations at mount time At each mount, emit the transaction reservation type information via tracepoints. This makes it easier to compare the log reservation info calculated by the kernel and xfsprogs so that we can more easily diagnose minimum log size failures on freshly formatted filesystems. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
|
#
411350df |
|
28-Aug-2017 |
Christoph Hellwig <hch@lst.de> |
xfs: refactor xfs_trans_roll Split xfs_trans_roll into a low-level helper that just rolls the actual transaction and a new higher level xfs_trans_roll_inode that takes care of logging and rejoining the inode. This gets rid of the NULL inode case, and allows to simplify the special cases in the deferred operation code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
9070733b |
|
03-May-2017 |
Michal Hocko <mhocko@suse.com> |
xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite some time ago. We would like to make this concept more generic and use it for other filesystems as well. Let's start by giving the flag a more generic name PF_MEMALLOC_NOFS which is in line with an exiting PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO contexts. Replace all PF_FSTRANS usage from the xfs code in the first step before we introduce a full API for it as xfs uses the flag directly anyway. This patch doesn't introduce any functional change. Link: http://lkml.kernel.org/r/20170306131408.9828-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterba@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Nikolay Borisov <nborisov@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
254133f5 |
|
06-Apr-2017 |
Christoph Hellwig <hch@lst.de> |
xfs: fold __xfs_trans_roll into xfs_trans_roll No one cares about the low-level helper anymore. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
#
e89c0413 |
|
28-Mar-2017 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: implement the GETFSMAP ioctl Introduce a new ioctl that uses the reverse mapping btree to return information about the physical layout of the filesystem. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
|
#
3fd129b6 |
|
18-Sep-2016 |
Darrick J. Wong <darrick.wong@oracle.com> |
xfs: set up per-AG free space reservations One unfortunate quirk of the reference count and reverse mapping btrees -- they can expand in size when blocks are written to *other* allocation groups if, say, one large extent becomes a lot of tiny extents. Since we don't want to start throwing errors in the middle of CoWing, we need to reserve some blocks to handle future expansion. The transaction block reservation counters aren't sufficient here because we have to have a reserve of blocks in every AG, not just somewhere in the filesystem. Therefore, create two per-AG block reservation pools. One feeds the AGFL so that rmapbt expansion always succeeds, and the other feeds all other metadata so that refcountbt expansion never fails. Use the count of how many reserved blocks we need to have on hand to create a virtual reservation in the AG. Through selective clamping of the maximum length of allocation requests and of the length of the longest free extent, we can make it look like there's less free space in the AG unless the reservation owner is asking for blocks. In other words, play some accounting tricks in-core to make sure that we always have blocks available. On the plus side, there's nothing to clean up if we crash, which is contrast to the strategy that the rough draft used (actually removing extents from the freespace btrees). Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
a27f6ef4 |
|
13-Sep-2016 |
Eryu Guan <eguan@redhat.com> |
xfs: undo block reservation correctly in xfs_trans_reserve() "blocks" should be added back to fdblocks at undo time, not taken away, i.e. the minus sign should not be used. This is a regression introduced by commit 0d485ada404b ("xfs: use generic percpu counters for free block counter"). And it's found by code inspection, I didn't it in real world, so there's no reproducer. Signed-off-by: Eryu Guan <eguan@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
710b1e2c |
|
05-Apr-2016 |
Christoph Hellwig <hch@lst.de> |
xfs: remove transaction types These aren't used for CIL-style logging and can be dropped. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
253f4911 |
|
05-Apr-2016 |
Christoph Hellwig <hch@lst.de> |
xfs: better xfs_trans_alloc interface Merge xfs_trans_reserve and xfs_trans_alloc into a single function call that returns a transaction with all the required log and block reservations, and which allows passing transaction flags directly to avoid the cumbersome _xfs_trans_alloc interface. While we're at it we also get rid of the transaction type argument that has been superflous since we stopped supporting the non-CIL logging mode. The guts of it will be removed in another patch. [dchinner: fixed transaction leak in error path in xfs_setattr_nonsize] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
cc07eed8 |
|
14-Mar-2016 |
Eric Sandeen <sandeen@redhat.com> |
xfs: ensure committed is initialized in xfs_trans_roll __xfs_trans_roll() can return without setting the *committed argument; this was a problem for xfs_bmap_finish(): int committed;/* xact committed or not */ ... error = __xfs_trans_roll(tp, ip, &committed); if (error) { ... if (committed) { and we tested an uninitialized "committed" variable on the error path. No caller is preserving "committed" state across calls to __xfs_trans_roll(), so just initialize committed inside the function to avoid future errors like this. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
ff6d6af2 |
|
12-Oct-2015 |
Bill O'Donnell <billodo@redhat.com> |
xfs: per-filesystem stats counter implementation This patch modifies the stats counting macros and the callers to those macros to properly increment, decrement, and add-to the xfs stats counts. The counts for global and per-fs stats are correctly advanced, and cleared by writing a "1" to the corresponding clear file. global counts: /sys/fs/xfs/stats/stats per-fs counts: /sys/fs/xfs/sda*/stats/stats global clear: /sys/fs/xfs/stats/stats_clear per-fs clear: /sys/fs/xfs/sda*/stats/stats_clear [dchinner: cleaned up macro variables, removed CONFIG_FS_PROC around stats structures and macros. ] Signed-off-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
d43ac29b |
|
18-Aug-2015 |
Brian Foster <bfoster@redhat.com> |
xfs: return committed status from xfs_trans_roll() Some callers need to make error handling decisions based on whether the current transaction successfully committed or not. Rename xfs_trans_roll(), add a new parameter and provide a wrapper to preserve existing callers. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
f78c3901 |
|
03-Jun-2015 |
Christoph Hellwig <hch@lst.de> |
xfs: fix xfs_log_done interface Instead of the confusing flags argument pass a boolean flag to indicate if we want to release or regrant a log reservation. Also ensure that xfs_log_done always drop the reference on the log ticket, to both simplify the code and make the logic in xfs_trans_roll easier to understand. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
70393313 |
|
03-Jun-2015 |
Christoph Hellwig <hch@lst.de> |
xfs: saner xfs_trans_commit interface The flags argument to xfs_trans_commit is not useful for most callers, as a commit of a transaction without a permanent log reservation must pass 0 here, and all callers for a transaction with a permanent log reservation except for xfs_trans_roll must pass XFS_TRANS_RELEASE_LOG_RES. So remove the flags argument from the public xfs_trans_commit interfaces, and introduce low-level __xfs_trans_commit variant just for xfs_trans_roll that regrants a log reservation instead of releasing it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
4906e215 |
|
03-Jun-2015 |
Christoph Hellwig <hch@lst.de> |
xfs: remove the flags argument to xfs_trans_cancel xfs_trans_cancel takes two flags arguments: XFS_TRANS_RELEASE_LOG_RES and XFS_TRANS_ABORT. Both of them are a direct product of the transaction state, and can be deducted: - any dirty transaction needs XFS_TRANS_ABORT to be properly canceled, and XFS_TRANS_ABORT is a noop for a transaction that is not dirty. - any transaction with a permanent log reservation needs XFS_TRANS_RELEASE_LOG_RES to be properly canceled, and passing XFS_TRANS_RELEASE_LOG_RES for a transaction without a permanent log reservation is invalid. So just remove the flags argument and do the right thing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
eacb24e7 |
|
03-Jun-2015 |
Christoph Hellwig <hch@lst.de> |
xfs: pass a boolean flag to xfs_trans_free_items The flags value always was 0 or XFS_TRANS_ABORT. Switch to a bool parameter to allow further cleanups. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
2e6db6c4 |
|
03-Jun-2015 |
Christoph Hellwig <hch@lst.de> |
xfs: switch remaining xfs_trans_dup users to xfs_trans_roll We have three remaining callers of xfs_trans_dup: - xfs_itruncate_extents which open codes xfs_trans_roll - xfs_bmap_finish doesn't have an xfs_inode argument and thus leaves attaching them to it's callers, but otherwise is identical to xfs_trans_roll - xfs_dir_ialloc looks at the log reservations in the old xfs_trans structure instead of the log reservation parameters, but otherwise is identical to xfs_trans_roll. By allowing a NULL xfs_inode argument to xfs_trans_roll we can switch these three remaining users over to xfs_trans_roll and mark xfs_trans_dup static. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
0bd5dded |
|
23-Feb-2015 |
Dave Chinner <dchinner@redhat.com> |
xfs: replace xfs_mod_incore_sb_batched Introduce helper functions for modifying fields in the superblock into xfs_trans.c, the only caller of xfs_mod_incore_sb_batch(). We can then use these directly in xfs_trans_unreserve_and_mod_sb() and so remove another user of the xfs_mode_incore_sb() API without losing any functionality or scalability of the transaction commit code.. Based on a patch from Christoph Hellwig. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
bab98bbe |
|
23-Feb-2015 |
Dave Chinner <dchinner@redhat.com> |
xfs: introduce xfs_mod_frextents Add a new helper to modify the incore counter of free realtime extents. This matches the helpers used for inode and data block counters, and removes a significant users of the xfs_mod_incore_sb() interface. Based on a patch originally from Christoph Hellwig. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
0d485ada |
|
23-Feb-2015 |
Dave Chinner <david@fromorbit.com> |
xfs: use generic percpu counters for free block counter XFS has hand-rolled per-cpu counters for the superblock since before there was any generic implementation. The free block counter is special in that it is used for ENOSPC detection outside transaction contexts for for delayed allocation. This means that the counter needs to be accurate at zero. The current per-cpu counter code jumps through lots of hoops to ensure we never run past zero, but we don't need to make all those jumps with the generic counter implementation. The generic counter implementation allows us to pass a "batch" threshold at which the addition/subtraction to the counter value will be folded back into global value under lock. We can use this feature to reduce the batch size as we approach 0 in a very similar manner to the existing counters and their rebalance algorithm. If we use a batch size of 1 as we approach 0, then every addition and subtraction will be done against the global value and hence allow accurate detection of zero threshold crossing. Hence we can replace the handrolled, accurate-at-zero counters with generic percpu counters. Note: this removes just enough of the icsb infrastructure to compile without warnings. The rest will go in subsequent commits. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
e88b64ea |
|
23-Feb-2015 |
Dave Chinner <dchinner@redhat.com> |
xfs: use generic percpu counters for free inode counter XFS has hand-rolled per-cpu counters for the superblock since before there was any generic implementation. The free inode counter is not used for any limit enforcement - the per-AG free inode counters are used during allocation to determine if there are inode available for allocation. Hence we don't need any of the complexity of the hand-rolled counters and we can simply replace them with generic per-cpu counters similar to the inode counter. This version introduces a xfs_mod_ifree() helper function from Christoph Hellwig. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
501ab323 |
|
23-Feb-2015 |
Dave Chinner <david@fromorbit.com> |
xfs: use generic percpu counters for inode counter XFS has hand-rolled per-cpu counters for the superblock since before there was any generic implementation. There are some warts around the use of them for the inode counter as the hand rolled counter is designed to be accurate at zero, but has no specific accurracy at any other value. This design causes problems for the maximum inode count threshold enforcement, as there is no trigger that balances the counters as they get close tothe maximum threshold. Instead of designing new triggers for balancing, just replace the handrolled per-cpu counter with a generic counter. This enables us to update the counter through the normal superblock modification funtions, but rather than do that we add a xfs_mod_icount() helper function (from Christoph Hellwig) and keep the percpu counter outside the superblock in the struct xfs_mount. This means we still need to initialise the per-cpu counter specifically when we read the superblock, and vice versa when we log/write it, but it does mean that we don't need to change any other code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
3443a3bc |
|
21-Jan-2015 |
Dave Chinner <dchinner@redhat.com> |
xfs: set superblock buffer type correctly When the superblock is modified in a transaction, the commonly modified fields are not actually copied to the superblock buffer to avoid the buffer lock becoming a serialisation point. However, there are some other operations that modify the superblock fields within the transaction that don't directly log to the superblock but rely on the changes to be applied during the transaction commit (to minimise the buffer lock hold time). When we do this, we fail to mark the buffer log item as being a superblock buffer and that can lead to the buffer not being marked with the corect type in the log and hence causing recovery issues. Fix it by setting the type correctly, similar to xfs_mod_sb()... cc: <stable@vger.kernel.org> # 3.10 to current Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
bb58e618 |
|
27-Nov-2014 |
Christoph Hellwig <hch@lst.de> |
xfs: move most of xfs_sb.h to xfs_format.h More on-disk format consolidation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
4fb6e8ad |
|
27-Nov-2014 |
Christoph Hellwig <hch@lst.de> |
xfs: merge xfs_ag.h into xfs_format.h More on-disk format consolidation. A few declarations that weren't on-disk format related move into better suitable spots. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
2451337d |
|
24-Jun-2014 |
Dave Chinner <dchinner@redhat.com> |
xfs: global error sign conversion Convert all the errors the core XFs code to negative error signs like the rest of the kernel and remove all the sign conversion we do in the interface layers. Errors for conversion (and comparison) found via searches like: $ git grep " E" fs/xfs $ git grep "return E" fs/xfs $ git grep " E[A-Z].*;$" fs/xfs Negation points found via searches like: $ git grep "= -[a-z,A-Z]" fs/xfs $ git grep "return -[a-z,A-D,F-Z]" fs/xfs $ git grep " -[a-z].*;" fs/xfs [ with some bits I missed from Brian Foster ] Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
b474c7ae |
|
21-Jun-2014 |
Eric Sandeen <sandeen@sandeen.net> |
xfs: Nuke XFS_ERROR macro XFS_ERROR was designed long ago to trap return values, but it's not runtime configurable, it's not consistently used, and we can do similar error trapping with ftrace scripts and triggers from userspace. Just nuke XFS_ERROR and associated bits. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
d99831ff |
|
21-Jun-2014 |
Eric Sandeen <sandeen@sandeen.net> |
xfs: return is not a function return is not a function. "return(EIO);" is silly; "return (EIO);" moreso. return is not a function. Nuke the pointless parens. [dchinner: catch a couple of extra cases in xfs_attr_list.c, xfs_acl.c and xfs_linux.h.] Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
e4a1e29c |
|
14-Apr-2014 |
Eric Sandeen <sandeen@redhat.com> |
xfs: remove unused ail pointer arg from xfs_trans_ail_cursor_done() Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
c6f97264 |
|
06-Feb-2014 |
Jie Liu <jeff.liu@oracle.com> |
xfs: convert xfs_log_commit_cil() to void Convert xfs_log_commit_cil() to a void function since it return nothing but 0 in any case, after that we can simplify the relative code logic in xfs_trans_commit() accordingly. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
a4fbe6ab |
|
22-Oct-2013 |
Dave Chinner <dchinner@redhat.com> |
xfs: decouple inode and bmap btree header files Currently the xfs_inode.h header has a dependency on the definition of the BMAP btree records as the inode fork includes an array of xfs_bmbt_rec_host_t objects in it's definition. Move all the btree format definitions from xfs_btree.h, xfs_bmap_btree.h, xfs_alloc_btree.h and xfs_ialloc_btree.h to xfs_format.h to continue the process of centralising the on-disk format definitions. With this done, the xfs inode definitions are no longer dependent on btree header files. The enables a massive culling of unnecessary includes, with close to 200 #include directives removed from the XFS kernel code base. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
239880ef |
|
22-Oct-2013 |
Dave Chinner <dchinner@redhat.com> |
xfs: decouple log and transaction headers xfs_trans.h has a dependency on xfs_log.h for a couple of structures. Most code that does transactions doesn't need to know anything about the log, but this dependency means that they have to include xfs_log.h. Decouple the xfs_trans.h and xfs_log.h header files and clean up the includes to be in dependency order. In doing this, remove the direct include of xfs_trans_reserve.h from xfs_trans.h so that we remove the dependency between xfs_trans.h and xfs_mount.h. Hence the xfs_trans.h include can be moved to the indicate the actual dependencies other header files have on it. Note that these are kernel only header files, so this does not translate to any userspace changes at all. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
57062787 |
|
14-Oct-2013 |
Dave Chinner <dchinner@redhat.com> |
xfs: unify directory/attribute format definitions The on-disk format definitions for the directory and attribute structures are spread across 3 header files right now, only one of which is dedicated to defining on-disk structures and their manipulation (xfs_dir2_format.h). Pull all the format definitions into a single header file - xfs_da_format.h - and switch all the code over to point at that. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
70a9883c |
|
22-Oct-2013 |
Dave Chinner <dchinner@redhat.com> |
xfs: create a shared header file for format-related information All of the buffer operations structures are needed to be exported for xfs_db, so move them all to a common location rather than spreading them all over the place. They are verifying the on-disk format, so while xfs_format.h might be a good place, it is not part of the on disk format. Hence we need to create a new header file that we centralise these related definitions. Start by moving the bffer operations structures, and then also move all the other definitions that have crept into xfs_log_format.h and xfs_format.h as there was no other shared header file to put them in. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
904c17e6 |
|
28-Aug-2013 |
Dave Chinner <dchinner@redhat.com> |
xfs: finish removing IOP_* macros. In optimising the CIL operations, some of the IOP_* macros for calling log item operations were removed. Remove the rest of them as Christoph requested. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Geoffrey Wehrman <gwehrman@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
3d3c8b52 |
|
12-Aug-2013 |
Jie Liu <jeff.liu@oracle.com> |
xfs: refactor xfs_trans_reserve() interface With the new xfs_trans_res structure has been introduced, the log reservation size, log count as well as log flags are pre-initialized at mount time. So it's time to refine xfs_trans_reserve() interface to be more neat. Also, introduce a new helper M_RES() to return a pointer to the mp->m_resv structure to simplify the input. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
0eadd102 |
|
12-Aug-2013 |
Jie Liu <jeff.liu@oracle.com> |
xfs: Introduce a new structure to hold transaction reservation items Introduce a new structure xfs_trans_res to hold transaction reservation item info per log ticket. We also need to improve xfs_trans_resv_calc() by initializing the log count as well as log flags for permanent log reservation. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
7fd36c44 |
|
12-Aug-2013 |
Dave Chinner <dchinner@redhat.com> |
xfs: split out transaction reservation code The transaction reservation size calculations is used by both kernel and userspace, but most of the transaction code in xfs_trans.c is kernel specific. Split all the transaction reservation code out into it's own files to make sharing with userspace simpler. This just leaves kernel-only definitions in xfs_trans.h, so it doesn't need to be shared with userspace anymore, either. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
2a3c0acc |
|
12-Aug-2013 |
Dave Chinner <dchinner@redhat.com> |
xfs: split out on-disk transaction definitions There's a bunch of definitions in xfs_trans.h that define on-disk formats - transaction headers that get written into the log, log item type definitions, etc. Split out everything into a separate file so that all which remains in xfs_trans.h are kernel only definitions. Also, remove the duplicate magic number definitions for XFS_TRANS_MAGIC... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
6ca1c906 |
|
12-Aug-2013 |
Dave Chinner <dchinner@redhat.com> |
xfs: separate dquot on disk format definitions out of xfs_quota.h The on disk format definitions of the on-disk dquot, log formats and quota off log formats are all intertwined with other definitions for quotas. Separate them out into their own header file so they can easily be shared with userspace. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
b8402b47 |
|
27-Jun-2013 |
Dave Chinner <david@fromorbit.com> |
xfs: Inode create transaction reservations Define the log and space transaction sizes. Factor the current create log reservation macro into the two logical halves and reuse one half for the new icreate transactions. The icreate transaction is transparent to all the high level create code - the pre-calculated reservations will correctly set the reservations dependent on whether the filesystem supports the icreate transaction. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
a21cd503 |
|
28-Jan-2013 |
Jeff Liu <jeff.liu@oracle.com> |
xfs: refactor space log reservation for XFS_TRANS_ATTR_SET Currently, we calculate the attribute set transaction log space reservation at runtime in two parts: 1) XFS_ATTRSET_LOG_RES() which is calcuated out at mount time. 2) ((ext * (mp)->m_sb.sb_sectsize) + \ (ext * XFS_FSB_TO_B((mp), XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK))) + \ (128 * (ext + (ext * XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK)))))) which is calculated out at runtime since it depend on the given extent length in blocks. This patch renamed XFS_ATTRSET_LOG_RES(mp) to XFS_ATTRSETM_LOG_RES(mp) to indicate that it is figured out at mount time. Introduce XFS_ATTRSETRT_LOG_RES(mp) which would be used to calculate out the unit of the log space reservation for one block. In this way, the total runtime space for the given extent length can be figured out by: XFS_ATTRSETM_LOG_RES(mp) + XFS_ATTRSETRT_LOG_RES(mp) * ext Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
a7bd794a |
|
28-Jan-2013 |
Jeff Liu <jeff.liu@oracle.com> |
xfs: introduce XFS_SB_LOG_RES() for transactions that modify sb on disk Introduce a new transaction space reservation XFS_SB_LOG_RES() for those transactions that need to modify the superblock on disk. Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
762d7ba6 |
|
28-Jan-2013 |
Jeff Liu <jeff.liu@oracle.com> |
xfs: calculate XFS_TRANS_QM_QUOTAOFF_END space log reservation at mount time Convert the calculation for end of quotaoff log space reservation from runtime to mount time. Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
a1bd9557 |
|
28-Jan-2013 |
Jeff Liu <jeff.liu@oracle.com> |
xfs: calculate XFS_TRANS_QM_QUOTAOFF space log reservation at mount time Convert the calculation of quota off transaction log space reservation from runtime to mount time. Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
48001044 |
|
28-Jan-2013 |
Jeff Liu <jeff.liu@oracle.com> |
xfs: calculate XFS_TRANS_QM_DQALLOC space log reservation at mount time The disk quota allocation log space reservation is calcuated at runtime, this patch does it at mount time. Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
f0f2df94 |
|
28-Jan-2013 |
Jeff Liu <jeff.liu@oracle.com> |
xfs: calcuate XFS_TRANS_QM_SETQLIM space log reservation at mount time For adjusting quota limits transactions, we calculate out the log space reservation at runtime, this patch does it at mount time. Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
b0c10b98 |
|
28-Jan-2013 |
Jeff Liu <jeff.liu@oracle.com> |
xfs: calculate XFS_TRANS_QM_SBCHANGE space log reservation at mount time The transaction log space for clearing/reseting the quota flags is calculated out at runtime, this patch can figure it out at mount time. Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
5b292ae3 |
|
01-Feb-2013 |
Jeff Liu <jeff.liu@oracle.com> |
xfs: make use of xfs_calc_buf_res() in xfs_trans.c Refining the existing reservations with xfs_calc_buf_res() in xfs_trans.c Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
4f3b5783 |
|
28-Jan-2013 |
Jeff Liu <jeff.liu@oracle.com> |
xfs: add a helper to figure out the space log reservation per item Add a new helper xfs_calc_buf_res() to calcuate out the transaction space reservations per item. xfs_buf_log_overhead() is used to figure out the extra space for struct xfs_buf_log_format that gets written into the log for every buffer as well as a log opheader, i.e. struct xlog_op_header. Signed-off-by: Jie Liu <jeff.liu@oracle.com> CC: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
d9457dc0 |
|
12-Jun-2012 |
Jan Kara <jack@suse.cz> |
xfs: Convert to new freezing code Generic code now blocks all writers from standard write paths. So we add blocking of all writers coming from ioctl (we get a protection of ioctl against racing remount read-only as a bonus) and convert xfs_file_aio_write() to a non-racy freeze protection. We also keep freeze protection on transaction start to block internal filesystem writes such as removal of preallocated blocks. CC: Ben Myers <bpm@sgi.com> CC: Alex Elder <elder@kernel.org> CC: xfs@oss.sgi.com Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
77ba7877 |
|
02-Apr-2012 |
Al Viro <viro@zeniv.linux.org.uk> |
xfs: switch to proper __bitwise type for KM_... flags Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
ad1e95c5 |
|
22-Apr-2012 |
Dave Chinner <dchinner@redhat.com> |
xfs: clean up xfs_bit.h includes With the removal of xfs_rw.h and other changes over time, xfs_bit.h is being included in many files that don't actually need it. Clean up the includes as necessary. Also move the only-used-once xfs_ialloc_find_free() static inline function out of a header file that is widely included to reduce the number of needless dependencies on xfs_bit.h. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
4ecbfe63 |
|
29-Apr-2012 |
Dave Chinner <dchinner@redhat.com> |
xfs: clean up busy extent naming Now that the busy extent tracking has been moved out of the allocation files, clean up the namespace it uses to "xfs_extent_busy" rather than a mix of "xfs_busy" and "xfs_alloc_busy". Signed-off-by: Dave Chinner<dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
efc27b52 |
|
29-Apr-2012 |
Dave Chinner <dchinner@redhat.com> |
xfs: move busy extent handling to it's own file To make it easier to handle userspace code merges, move all the busy extent handling out of the allocation code and into it's own file. The userspace code does not need the busy extent code, so this simplifies the merging of the kernel code into the userspace xfsprogs library. Because the busy extent code has been almost completely rewritten over the past couple of years, also update the copyright on this new file to include the authors that made all those changes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
60a34607 |
|
22-Apr-2012 |
Dave Chinner <dchinner@redhat.com> |
xfs: move xfsagino_t to xfs_types.h Untangle the header file includes a bit by moving the definition of xfs_agino_t to xfs_types.h. This removes the dependency that xfs_ag.h has on xfs_inum.h, meaning we don't need to include xfs_inum.h everywhere we include xfs_ag.h. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
9006fb91 |
|
19-Feb-2012 |
Christoph Hellwig <hch@infradead.org> |
xfs: split and cleanup xfs_log_reserve Split the log regrant case out of xfs_log_reserve into a separate function, and merge xlog_grant_log_space and xlog_regrant_write_log_space into their respective callers. Also replace the XFS_LOG_PERM_RESERV flag, which easily got misused before the previous cleanups with a simple boolean parameter. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
f65020a8 |
|
13-Feb-2012 |
Jesper Juhl <jj@chaosbits.net> |
XFS: xfs_trans_add_item() - don't assign in ASSERT() when compare is intended It looks to me like the two ASSERT()s in xfs_trans_add_item() really want to do a compare (==) rather than assignment (=). This patch changes it from the latter to the former. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 05293485a0b6b1f803e8a3c0ff188c38f6969985)
|
#
05293485 |
|
13-Feb-2012 |
Jesper Juhl <jj@chaosbits.net> |
XFS: xfs_trans_add_item() - don't assign in ASSERT() when compare is intended It looks to me like the two ASSERT()s in xfs_trans_add_item() really want to do a compare (==) rather than assignment (=). This patch changes it from the latter to the former. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
b3934213 |
|
06-Dec-2011 |
Christoph Hellwig <hch@infradead.org> |
xfs: remove the lid_size field in struct log_item_desc Outside the now removed nodelaylog code this field is only used for asserts and can be safely removed now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
0244b960 |
|
06-Dec-2011 |
Christoph Hellwig <hch@infradead.org> |
xfs: cleanup the transaction commit path a bit Now that the nodelaylog mode is gone we can simplify the transaction commit path a bit by removing the xfs_trans_commit_cil routine. Restoring the process flags is merged into xfs_trans_commit which already does it for the error path, and allocating the log vectors is merged into xlog_cil_format_items, which already fills them with data, thus avoiding one loop over all log items. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
93b8a585 |
|
06-Dec-2011 |
Christoph Hellwig <hch@infradead.org> |
xfs: remove the deprecated nodelaylog option The delaylog mode has been the default for a long time, and the nodelaylog option has been scheduled for removal in Linux 3.3. Remove it and code only used by it now that we have opened the 3.3 window. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
|
#
ddc3415a |
|
19-Sep-2011 |
Christoph Hellwig <hch@infradead.org> |
xfs: simplify xfs_trans_ijoin* again There is no reason to keep a reference to the inode even if we unlock it during transaction commit because we never drop a reference between the ijoin and commit. Also use this fact to merge xfs_trans_ijoin_ref back into xfs_trans_ijoin - the third argument decides if an unlock is needed now. I'm actually starting to wonder if allowing inodes to be unlocked at transaction commit really is worth the effort. The only real benefit is that they can be unlocked earlier when commiting a synchronous transactions, but that could be solved by doing the log force manually after the unlock, too. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
|
#
b1037058 |
|
19-Sep-2011 |
Christoph Hellwig <hch@infradead.org> |
xfs: unlock the inode before log force in xfs_fsync Only read the LSN we need to push to with the ilock held, and then release it before we do the log force to improve concurrency. This also removes the only direct caller of _xfs_trans_commit, thus allowing it to be merged into the plain xfs_trans_commit again. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
|
#
1d8c95a3 |
|
17-Jul-2011 |
Dave Chinner <dchinner@redhat.com> |
xfs: use a cursor for bulk AIL insertion Delayed logging can insert tens of thousands of log items into the AIL at the same LSN. When the committing of log commit records occur, we can get insertions occurring at an LSN that is not at the end of the AIL. If there are thousands of items in the AIL on the tail LSN, each insertion has to walk the AIL to find the correct place to insert the new item into the AIL. This can consume large amounts of CPU time and block other operations from occurring while the traversals are in progress. To avoid this repeated walk, use a AIL cursor to record where we should be inserting the new items into the AIL without having to repeat the walk. The cursor infrastructure already provides this functionality for push walks, so is a simple extension of existing code. While this will not avoid the initial walk, it will avoid repeating it tens of thousands of times during a single checkpoint commit. This version includes logic improvements from Christoph Hellwig. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
|
#
b2ce3974 |
|
11-Jul-2011 |
Alex Elder <aelder@sgi.com> |
Revert "xfs: fix filesystsem freeze race in xfs_trans_alloc" This reverts commit 7a249cf83da1813cfa71cfe1e265b40045eceb47. That commit created a situation that could lead to a filesystem hang. As Dave Chinner pointed out, xfs_trans_alloc() could hold a reference to m_active_trans (i.e., keep it non-zero) and then wait for SB_FREEZE_TRANS to complete. Meanwhile a filesystem freeze request could set SB_FREEZE_TRANS and then wait for m_active_trans to drop to zero. Nobody benefits from this sequence of events... Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
|
#
7a249cf8 |
|
08-Jul-2011 |
Christoph Hellwig <hch@lst.de> |
xfs: fix filesystsem freeze race in xfs_trans_alloc As pointed out by Jan xfs_trans_alloc can race with a concurrent filesystem freeze when it sleeps during the memory allocation. Fix this by moving the wait_for_freeze call after the memory allocation. This means moving the freeze into the low-level _xfs_trans_alloc helper, which thus grows a new argument. Also fix up some comments in that area while at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <david@fromorbit.com>
|
#
1316d4da |
|
03-Jul-2011 |
Dave Chinner <dchinner@redhat.com> |
xfs: unpin stale inodes directly in IOP_COMMITTED When inodes are marked stale in a transaction, they are treated specially when the inode log item is being inserted into the AIL. It tries to avoid moving the log item forward in the AIL due to a race condition with the writing the underlying buffer back to disk. The was "fixed" in commit de25c18 ("xfs: avoid moving stale inodes in the AIL"). To avoid moving the item forward, we return a LSN smaller than the commit_lsn of the completing transaction, thereby trying to trick the commit code into not moving the inode forward at all. I'm not sure this ever worked as intended - it assumes the inode is already in the AIL, but I don't think the returned LSN would have been small enough to prevent moving the inode. It appears that the reason it worked is that the lower LSN of the inodes meant they were inserted into the AIL and flushed before the inode buffer (which was moved to the commit_lsn of the transaction). The big problem is that with delayed logging, the returning of the different LSN means insertion takes the slow, non-bulk path. Worse yet is that insertion is to a position -before- the commit_lsn so it is doing a AIL traversal on every insertion, and has to walk over all the items that have already been inserted into the AIL. It's expensive. To compound the matter further, with delayed logging inodes are likely to go from clean to stale in a single checkpoint, which means they aren't even in the AIL at all when we come across them at AIL insertion time. Hence these were all getting inserted into the AIL when they simply do not need to be as inodes marked XFS_ISTALE are never written back. Transactional/recovery integrity is maintained in this case by the other items in the unlink transaction that were modified (e.g. the AGI btree blocks) and committed in the same checkpoint. So to fix this, simply unpin the stale inodes directly in xfs_inode_item_committed() and return -1 to indicate that the AIL insertion code does not need to do any further processing of these inodes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
|
#
e84661aa |
|
20-May-2011 |
Christoph Hellwig <hch@infradead.org> |
xfs: add online discard support Now that we have reliably tracking of deleted extents in a transaction we can easily implement "online" discard support which calls blkdev_issue_discard once a transaction commits. The actual discard is a two stage operation as we first have to mark the busy extent as not available for reuse before we can start the actual discard. Note that we don't bother supporting discard for the non-delaylog mode. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
|
#
8a072a4d |
|
24-Apr-2011 |
Christoph Hellwig <hch@infradead.org> |
xfs: reduce the number of pagb_lock roundtrips in xfs_alloc_clear_busy Instead of finding the per-ag and then taking and releasing the pagb_lock for every single busy extent completed sort the list of busy extents and only switch betweens AGs where nessecary. This becomes especially important with the online discard support which will hit this lock more often. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
|
#
c6f990d1 |
|
26-Jan-2011 |
Dave Chinner <dchinner@redhat.com> |
xfs: handle CIl transaction commit failures correctly Failure to commit a transaction into the CIL is not handled correctly. This currently can only happen when racing with a shutdown and requires an explicit shutdown check, so it rare and can be avoided. Remove the shutdown check and make the CIL commit a void function to indicate it will always succeed, thereby removing the incorrectly handled failure case. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>
|
#
e34a314c |
|
26-Jan-2011 |
Dave Chinner <dchinner@redhat.com> |
xfs: fix efi item leak on forced shutdown After test 139, kmemleak shows: unreferenced object 0xffff880078b405d8 (size 400): comm "xfs_io", pid 4904, jiffies 4294909383 (age 1186.728s) hex dump (first 32 bytes): 60 c1 17 79 00 88 ff ff 60 c1 17 79 00 88 ff ff `..y....`..y.... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff81afb04d>] kmemleak_alloc+0x2d/0x60 [<ffffffff8115c6cf>] kmem_cache_alloc+0x13f/0x2b0 [<ffffffff814aaa97>] kmem_zone_alloc+0x77/0xf0 [<ffffffff814aab2e>] kmem_zone_zalloc+0x1e/0x50 [<ffffffff8147cd6b>] xfs_efi_init+0x4b/0xb0 [<ffffffff814a4ee8>] xfs_trans_get_efi+0x58/0x90 [<ffffffff81455fab>] xfs_bmap_finish+0x8b/0x1d0 [<ffffffff814851b4>] xfs_itruncate_finish+0x2c4/0x5d0 [<ffffffff814a970f>] xfs_setattr+0x8df/0xa70 [<ffffffff814b5c7b>] xfs_vn_setattr+0x1b/0x20 [<ffffffff8117dc00>] notify_change+0x170/0x2e0 [<ffffffff81163bf6>] do_truncate+0x66/0xa0 [<ffffffff81163d0b>] sys_ftruncate+0xdb/0xe0 [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff The cause of the leak is that the "remove" parameter of IOP_UNPIN() is never set when a CIL push is aborted. This means that the EFI item is never freed if it was in the push being cancelled. The problem is specific to delayed logging, but has uncovered a couple of problems with the handling of IOP_UNPIN(remove). Firstly, we cannot safely call xfs_trans_del_item() from IOP_UNPIN() in the CIL commit failure path or the iclog write failure path because for delayed loging we have no transaction context. Hence we must only call xfs_trans_del_item() if the log item being unpinned has an active log item descriptor. Secondly, xfs_trans_uncommit() does not handle log item descriptor freeing during the traversal of log items on a transaction. It can reference a freed log item descriptor when unpinning an EFI item. Hence it needs to use a safe list traversal method to allow items to be removed from the transaction during IOP_UNPIN(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>
|
#
1884bd83 |
|
25-Dec-2010 |
Jesper Juhl <jj@chaosbits.net> |
xfs: fix an assignment within an ASSERT() In fs/xfs/xfs_trans.c::xfs_trans_unreserve_and_mod_sb() at the out: label we have this: ASSERT(error = 0); I believe a comparison was intended, not an assignment. If I'm right, the patch below fixes that up. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Alex Elder <aelder@sgi.com>
|
#
0e57f6a3 |
|
19-Dec-2010 |
Dave Chinner <dchinner@redhat.com> |
xfs: bulk AIL insertion during transaction commit When inserting items into the AIL from the transaction committed callbacks, we take the AIL lock for every single item that is to be inserted. For a CIL checkpoint commit, this can be tens of thousands of individual inserts, yet almost all of the items will be inserted at the same point in the AIL because they have the same index. To reduce the overhead and contention on the AIL lock for such operations, introduce a "bulk insert" operation which allows a list of log items with the same LSN to be inserted in a single operation via a list splice. To do this, we need to pre-sort the log items being committed into a temporary list for insertion. The complexity is that not every log item will end up with the same LSN, and not every item is actually inserted into the AIL. Items that don't match the commit LSN will be inserted and unpinned as per the current one-at-a-time method (relatively rare), while items that are not to be inserted will be unpinned and freed immediately. Items that are to be inserted at the given commit lsn are placed in a temporary array and inserted into the AIL in bulk each time the array fills up. As a result of this, we trade off AIL hold time for a significant reduction in traffic. lock_stat output shows that the worst case hold time is unchanged, but contention from AIL inserts drops by an order of magnitude and the number of lock traversal decreases significantly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
4957a449 |
|
06-Oct-2010 |
Christoph Hellwig <hch@infradead.org> |
xfs: fix the xfs_trans_committed Use the correct prototype for xfs_trans_committed instead of casting it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
|
#
dfe188d4 |
|
06-Oct-2010 |
Christoph Hellwig <hch@infradead.org> |
xfs: remove unused t_callback field in struct xfs_trans Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
|
#
1b040712 |
|
29-Sep-2010 |
Christoph Hellwig <hch@infradead.org> |
xfs: do not use xfs_mod_incore_sb_batch for per-cpu counters Update the per-cpu counters manually in xfs_trans_unreserve_and_mod_sb and remove support for per-cpu counters from xfs_mod_incore_sb_batch to simplify it. And added benefit is that we don't have to take m_sb_lock for transactions that only modify per-cpu counters. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
|
#
96540c78 |
|
29-Sep-2010 |
Christoph Hellwig <hch@infradead.org> |
xfs: do not use xfs_mod_incore_sb for per-cpu counters Export xfs_icsb_modify_counters and always use it for modifying the per-cpu counters. Remove support for per-cpu counters from xfs_mod_incore_sb to simplify it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
|
#
d17c701c |
|
23-Aug-2010 |
Dave Chinner <dchinner@redhat.com> |
xfs: unlock items before allowing the CIL to commit When we commit a transaction using delayed logging, we need to unlock the items in the transaciton before we unlock the CIL context and allow it to be checkpointed. If we unlock them after we release the CIl context lock, the CIL can checkpoint and complete before we free the log items. This breaks stale buffer item unlock and unpin processing as there is an implicit assumption that the unlock will occur before the unpin. Also, some log items need to store the LSN of the transaction commit in the item (inodes and EFIs) and so can race with other transaction completions if we don't prevent the CIL from checkpointing before the unlock occurs. Cc: <stable@kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
43869706 |
|
20-Jul-2010 |
Dave Chinner <dchinner@redhat.com> |
xfs: fix xfs_trans_add_item() lockdep warnings xfs_trans_add_item() is called with ip->i_ilock held, which means it is unsafe for memory reclaim to recurse back into the filesystem (ilock is required in writeback). Hence the allocation needs to be KM_NOFS to avoid recursion. Lockdep report indicating memory allocation being called with the ip->i_ilock held is as follows: [ 1749.866796] ================================= [ 1749.867788] [ INFO: inconsistent lock state ] [ 1749.868327] 2.6.35-rc3-dgc+ #25 [ 1749.868741] --------------------------------- [ 1749.868741] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage. [ 1749.868741] dd/2835 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 1749.868741] (&(&ip->i_lock)->mr_lock){++++?.}, at: [<ffffffff813170fb>] xfs_ilock+0x10b/0x190 [ 1749.868741] {IN-RECLAIM_FS-W} state was registered at: [ 1749.868741] [<ffffffff810b3a97>] __lock_acquire+0x437/0x1450 [ 1749.868741] [<ffffffff810b4b56>] lock_acquire+0xa6/0x160 [ 1749.868741] [<ffffffff810a20b5>] down_write_nested+0x65/0xb0 [ 1749.868741] [<ffffffff813170fb>] xfs_ilock+0x10b/0x190 [ 1749.868741] [<ffffffff8134e819>] xfs_reclaim_inode+0x99/0x310 [ 1749.868741] [<ffffffff8134f56b>] xfs_inode_ag_walk+0x8b/0x150 [ 1749.868741] [<ffffffff8134f6bb>] xfs_inode_ag_iterator+0x8b/0xf0 [ 1749.868741] [<ffffffff8134f7a8>] xfs_reclaim_inode_shrink+0x88/0x90 [ 1749.868741] [<ffffffff81119d07>] shrink_slab+0x137/0x1a0 [ 1749.868741] [<ffffffff8111bbe1>] balance_pgdat+0x421/0x6a0 [ 1749.868741] [<ffffffff8111bf7d>] kswapd+0x11d/0x320 [ 1749.868741] [<ffffffff8109ce56>] kthread+0x96/0xa0 [ 1749.868741] [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10 [ 1749.868741] irq event stamp: 4234335 [ 1749.868741] hardirqs last enabled at (4234335): [<ffffffff81147d25>] kmem_cache_free+0x115/0x220 [ 1749.868741] hardirqs last disabled at (4234334): [<ffffffff81147c4d>] kmem_cache_free+0x3d/0x220 [ 1749.868741] softirqs last enabled at (4233112): [<ffffffff81084dd2>] __do_softirq+0x142/0x260 [ 1749.868741] softirqs last disabled at (4233095): [<ffffffff81035edc>] call_softirq+0x1c/0x50 [ 1749.868741] [ 1749.868741] other info that might help us debug this: [ 1749.868741] 2 locks held by dd/2835: [ 1749.868741] #0: (&(&ip->i_iolock)->mr_lock#2){+.+.+.}, at: [<ffffffff81316edd>] xfs_ilock_nowait+0xed/0x200 [ 1749.868741] #1: (&(&ip->i_lock)->mr_lock){++++?.}, at: [<ffffffff813170fb>] xfs_ilock+0x10b/0x190 [ 1749.868741] [ 1749.868741] stack backtrace: [ 1749.868741] Pid: 2835, comm: dd Not tainted 2.6.35-rc3-dgc+ #25 [ 1749.868741] Call Trace: [ 1749.868741] [<ffffffff810b1faa>] print_usage_bug+0x18a/0x190 [ 1749.868741] [<ffffffff8104264f>] ? save_stack_trace+0x2f/0x50 [ 1749.868741] [<ffffffff810b2400>] ? check_usage_backwards+0x0/0xf0 [ 1749.868741] [<ffffffff810b2f11>] mark_lock+0x331/0x400 [ 1749.868741] [<ffffffff810b3047>] mark_held_locks+0x67/0x90 [ 1749.868741] [<ffffffff810b3111>] lockdep_trace_alloc+0xa1/0xe0 [ 1749.868741] [<ffffffff81147419>] kmem_cache_alloc+0x39/0x1e0 [ 1749.868741] [<ffffffff8133f954>] kmem_zone_alloc+0x94/0xe0 [ 1749.868741] [<ffffffff8133f9be>] kmem_zone_zalloc+0x1e/0x50 [ 1749.868741] [<ffffffff81335f02>] xfs_trans_add_item+0x72/0xb0 [ 1749.868741] [<ffffffff81339e41>] xfs_trans_ijoin+0xa1/0xd0 [ 1749.868741] [<ffffffff81319f82>] xfs_itruncate_finish+0x312/0x5d0 [ 1749.868741] [<ffffffff8133cb87>] xfs_free_eofblocks+0x227/0x280 [ 1749.868741] [<ffffffff8133cd18>] xfs_release+0x138/0x190 [ 1749.868741] [<ffffffff813464c5>] xfs_file_release+0x15/0x20 [ 1749.868741] [<ffffffff81150ebf>] fput+0x13f/0x260 [ 1749.868741] [<ffffffff8114d8c2>] filp_close+0x52/0x80 [ 1749.868741] [<ffffffff8114d9a9>] sys_close+0xb9/0x120 [ 1749.868741] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
|
#
898621d5 |
|
23-Jun-2010 |
Christoph Hellwig <hch@infradead.org> |
xfs: simplify inode to transaction joining Currently we need to either call IHOLD or xfs_trans_ihold on an inode when joining it to a transaction via xfs_trans_ijoin. This patches instead makes xfs_trans_ijoin usable on it's own by doing an implicity xfs_trans_ihold, which also allows us to drop the third argument. For the case where we want to hold a reference on the inode a xfs_trans_ijoin_ref wrapper is added which does the IHOLD and marks the inode for needing an xfs_iput. In addition to the cleaner interface to the caller this also simplifies the implementation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
9412e318 |
|
23-Jun-2010 |
Christoph Hellwig <hch@infradead.org> |
xfs: merge iop_unpin_remove into iop_unpin The unpin_remove item operation instances always share most of the implementation with the respective unpin implementation. So instead of keeping two different entry points add a remove flag to the unpin operation and share the code more easily. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
e98c414f |
|
23-Jun-2010 |
Christoph Hellwig <hch@infradead.org> |
xfs: simplify log item descriptor tracking Currently we track log item descriptor belonging to a transaction using a complex opencoded chunk allocator. This code has been there since day one and seems to work around the lack of an efficient slab allocator. This patch replaces it with dynamically allocated log item descriptors from a dedicated slab pool, linked to the transaction by a linked list. This allows to greatly simplify the log item descriptor tracking to the point where it's just a couple hundred lines in xfs_trans.c instead of a separate file. The external API has also been simplified while we're at it - the xfs_trans_add_item and xfs_trans_del_item functions to add/ delete items from a transaction have been simplified to the bare minium, and the xfs_trans_find_item function is replaced with a direct dereference of the li_desc field. All debug code walking the list of log items in a transaction is down to a simple list_for_each_entry. Note that we could easily use a singly linked list here instead of the double linked list from list.h as the fastpath only does deletion from sequential traversal. But given that we don't have one available as a library function yet I use the list.h functions for simplicity. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
3400777f |
|
23-Jun-2010 |
Christoph Hellwig <hch@infradead.org> |
xfs: remove unneeded #include statements Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>
|
#
288699fe |
|
23-Jun-2010 |
Christoph Hellwig <hch@infradead.org> |
xfs: drop dmapi hooks Dmapi support was never merged upstream, but we still have a lot of hooks bloating XFS for it, all over the fast pathes of the filesystem. This patch drops over 700 lines of dmapi overhead. If we'll ever get HSM support in mainline at least the namespace events can be done much saner in the VFS instead of the individual filesystem, so it's not like this is much help for future work. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
#
025101dc |
|
04-May-2010 |
Christoph Hellwig <hch@infradead.org> |
xfs: cleanup log reservation calculactions Instead of having small helper functions calling big macros do the calculations for the log reservations directly in the functions. These are mostly 1:1 from the macros execept that the macros kept the quota calculations in their callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
|
#
71e330b5 |
|
20-May-2010 |
Dave Chinner <dchinner@redhat.com> |
xfs: Introduce delayed logging core code The delayed logging code only changes in-memory structures and as such can be enabled and disabled with a mount option. Add the mount option and emit a warning that this is an experimental feature that should not be used in production yet. We also need infrastructure to track committed items that have not yet been written to the log. This is what the Committed Item List (CIL) is for. The log item also needs to be extended to track the current log vector, the associated memory buffer and it's location in the Commit Item List. Extend the log item and log vector structures to enable this tracking. To maintain the current log format for transactions with delayed logging, we need to introduce a checkpoint transaction and a context for tracking each checkpoint from initiation to transaction completion. This includes adding a log ticket for tracking space log required/used by the context checkpoint. To track all the changes we need an io vector array per log item, rather than a single array for the entire transaction. Using the new log vector structure for this requires two passes - the first to allocate the log vector structures and chain them together, and the second to fill them out. This log vector chain can then be passed to the CIL for formatting, pinning and insertion into the CIL. Formatting of the log vector chain is relatively simple - it's just a loop over the iovecs on each log vector, but it is made slightly more complex because we re-write the iovec after the copy to point back at the memory buffer we just copied into. This code also needs to pin log items. If the log item is not already tracked in this checkpoint context, then it needs to be pinned. Otherwise it is already pinned and we don't need to pin it again. The only other complexity is calculating the amount of new log space the formatting has consumed. This needs to be accounted to the transaction in progress, and the accounting is made more complex becase we need also to steal space from it for log metadata in the checkpoint transaction. Calculate all this at insert time and update all the tickets, counters, etc correctly. Once we've formatted all the log items in the transaction, attach the busy extents to the checkpoint context so the busy extents live until checkpoint completion and can be processed at that point in time. Transactions can then be freed at this point in time. Now we need to issue checkpoints - we are tracking the amount of log space used by the items in the CIL, so we can trigger background checkpoints when the space usage gets to a certain threshold. Otherwise, checkpoints need ot be triggered when a log synchronisation point is reached - a log force event. Because the log write code already handles chained log vectors, writing the transaction is trivial, too. Construct a transaction header, add it to the head of the chain and write it into the log, then issue a commit record write. Then we can release the checkpoint log ticket and attach the context to the log buffer so it can be called during Io completion to complete the checkpoint. We also need to allow for synchronising multiple in-flight checkpoints. This is needed for two things - the first is to ensure that checkpoint commit records appear in the log in the correct sequence order (so they are replayed in the correct order). The second is so that xfs_log_force_lsn() operates correctly and only flushes and/or waits for the specific sequence it was provided with. To do this we need a wait variable and a list tracking the checkpoint commits in progress. We can walk this list and wait for the checkpoints to change state or complete easily, an this provides the necessary synchronisation for correct operation in both cases. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
|
#
ed3b4d6c |
|
20-May-2010 |
Dave Chinner <david@fromorbit.com> |
xfs: Improve scalability of busy extent tracking When we free a metadata extent, we record it in the per-AG busy extent array so that it is not re-used before the freeing transaction hits the disk. This array is fixed size, so when it overflows we make further allocation transactions synchronous because we cannot track more freed extents until those transactions hit the disk and are completed. Under heavy mixed allocation and freeing workloads with large log buffers, we can overflow this array quite easily. Further, the array is sparsely populated, which means that inserts need to search for a free slot, and array searches often have to search many more slots that are actually used to check all the busy extents. Quite inefficient, really. To enable this aspect of extent freeing to scale better, we need a structure that can grow dynamically. While in other areas of XFS we have used radix trees, the extents being freed are at random locations on disk so are better suited to being indexed by an rbtree. So, use a per-AG rbtree indexed by block number to track busy extents. This incures a memory allocation when marking an extent busy, but should not occur too often in low memory situations. This should scale to an arbitrary number of extents so should not be a limitation for features such as in-memory aggregation of transactions. However, there are still situations where we can't avoid allocating busy extents (such as allocation from the AGFL). To minimise the overhead of such occurences, we need to avoid doing a synchronous log force while holding the AGF locked to ensure that the previous transactions are safely on disk before we use the extent. We can do this by marking the transaction doing the allocation as synchronous rather issuing a log force. Because of the locking involved and the ordering of transactions, the synchronous transaction provides the same guarantees as a synchronous log force because it ensures that all the prior transactions are already on disk when the synchronous transaction hits the disk. i.e. it preserves the free->allocate order of the extent correctly in recovery. By doing this, we avoid holding the AGF locked while log writes are in progress, hence reducing the length of time the lock is held and therefore we increase the rate at which we can allocate and free from the allocation group, thereby increasing overall throughput. The only problem with this approach is that when a metadata buffer is marked stale (e.g. a directory block is removed), then buffer remains pinned and locked until the log goes to disk. The issue here is that if that stale buffer is reallocated in a subsequent transaction, the attempt to lock that buffer in the transaction will hang waiting the log to go to disk to unlock and unpin the buffer. Hence if someone tries to lock a pinned, stale, locked buffer we need to push on the log to get it unlocked ASAP. Effectively we are trading off a guaranteed log force for a much less common trigger for log force to occur. Ideally we should not reallocate busy extents. That is a much more complex fix to the problem as it involves direct intervention in the allocation btree searches in many places. This is left to a future set of modifications. Finally, now that we track busy extents in allocated memory, we don't need the descriptors in the transaction structure to point to them. We can replace the complex busy chunk infrastructure with a simple linked list of busy extents. This allows us to remove a large chunk of code, making the overall change a net reduction in code size. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
|
#
b1c1b5b6 |
|
22-Mar-2010 |
Dave Chinner <dchinner@redhat.com> |
xfs: Clean up xfs_trans_committed code after factoring Now that the code has been factored, clean up all the remaining style cruft, simplify the code and re-order functions so that it doesn't need forward declarations. Also move the remaining functions that require forward declarations (xfs_trans_uncommit, xfs_trans_free) so that all the forward declarations can be removed from the file. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
8e646a55 |
|
07-Mar-2010 |
Dave Chinner <dchinner@redhat.com> |
xfs: update and factor xfs_trans_committed() The function header to xfs-trans_committed has long had this comment: * THIS SHOULD BE REWRITTEN TO USE xfs_trans_next_item() To prepare for different methods of committing items, convert the code to use xfs_trans_next_item() and factor the code into smaller, more digestible chunks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
a3ccd2ca |
|
14-Mar-2010 |
Christoph Hellwig <hch@infradead.org> |
xfs: clean up xfs_trans_commit logic even more > +shut_us_down: > + shutdown = XFS_FORCED_SHUTDOWN(mp) ? EIO : 0; > + if (!(tp->t_flags & XFS_TRANS_DIRTY) || shutdown) { > + xfs_trans_unreserve_and_mod_sb(tp); > + /* This whole area in _xfs_trans_commit is still a complete mess. So while touching this code, unravel this mess as well to make the whole flow of the function simpler and clearer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com>
|
#
0924378a |
|
07-Mar-2010 |
Dave Chinner <dchinner@redhat.com> |
xfs: split out iclog writing from xfs_trans_commit() Split the the part of xfs_trans_commit() that deals with writing the transaction into the iclog into a separate function. This isolates the physical commit process from the logical commit operation and makes it easier to insert different transaction commit paths without affecting the existing algorithm adversely. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
8e123850 |
|
07-Mar-2010 |
Dave Chinner <dchinner@redhat.com> |
xfs: remove stale parameter from ->iop_unpin method The staleness of a object being unpinned can be directly derived from the object itself - there is no need to extract it from the object then pass it as a parameter into IOP_UNPIN(). This means we can kill the XFS_LID_BUF_STALE flag - it is set, checked and cleared in the same places XFS_BLI_STALE flag in the xfs_buf_log_item so it is now redundant and hence safe to remove. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
|
#
35a8a72f |
|
15-Feb-2010 |
Christoph Hellwig <hch@infradead.org> |
xfs: stop passing opaque handles to xfs_log.c routines Currenly we pass opaque xfs_log_ticket_t handles instead of struct xlog_ticket pointers, and void pointers instead of struct xlog_in_core pointers to various log manager functions. Instead pass properly typed pointers after adding forward declarations for them to xfs_log.h, and adjust the touched function prototypes to the standard XFS style while at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>
|
#
a14a348b |
|
19-Jan-2010 |
Christoph Hellwig <hch@infradead.org> |
xfs: cleanup up xfs_log_force calling conventions Remove the XFS_LOG_FORCE argument which was always set, and the XFS_LOG_URGE define, which was never used. Split xfs_log_force into a two helpers - xfs_log_force which forces the whole log, and xfs_log_force_lsn which forces up to the specified LSN. The underlying implementations already were entirely separate, as were the users. Also re-indent the new _xfs_log_force/_xfs_log_force which previously had a weird coding style. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
|
#
4139b3b3 |
|
19-Jan-2010 |
Christoph Hellwig <hch@infradead.org> |
xfs: kill XLOG_VEC_SET_TYPE This macro only obsfucates the log item type assignments, so kill it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>
|
#
80641dc6 |
|
18-Oct-2009 |
Christoph Hellwig <hch@infradead.org> |
xfs: I/O completion handlers must use NOFS allocations When completing I/O requests we must not allow the memory allocator to recurse into the filesystem, as we might deadlock on waiting for the I/O completion otherwise. The only thing currently allocating normal GFP_KERNEL memory is the allocation of the transaction structure for the unwritten extent conversion. Add a memflags argument to _xfs_trans_alloc to allow controlling the allocator behaviour. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Thomas Neumann <tneumann@users.sourceforge.net> Tested-by: Thomas Neumann <tneumann@users.sourceforge.net> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>
|
#
f9502216 |
|
04-Jun-2009 |
Christoph Hellwig <hch@lst.de> |
xfs: remove ->write_super and stop maintaining ->s_dirt the write_super method is used for (1) writing back the superblock periodically from pdflush (2) called just before ->sync_fs for data integerity syncs We don't need (1) because we have our own peridoc writeout through xfssyncd, and we don't need (2) because xfs_fs_sync_fs performs a proper synchronous superblock writeout after all other data and metadata has been written out. Also remove ->s_dirt tracking as it's only used to decide when too call ->write_super. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
7d095257 |
|
08-Jun-2009 |
Christoph Hellwig <hch@lst.de> |
xfs: kill xfs_qmops Kill the quota ops function vector and replace it with direct calls or stubs in the CONFIG_XFS_QUOTA=n case. Make sure we check XFS_IS_QUOTA_RUNNING in the right spots. We can remove the number of those checks because the XFS_TRANS_DQ_DIRTY flag can't be set otherwise. This brings us back closer to the way this code worked in IRIX and earlier Linux versions, but we keep a lot of the more useful factoring of common code. Eventually we should also kill xfs_qm_bhv.c, but that's left for a later patch. Reduces the size of the source code by about 250 lines and the size of XFS module by about 1.5 kilobytes with quotas enabled: text data bss dec hex filename 615957 2960 3848 622765 980ad fs/xfs/xfs.o 617231 3152 3848 624231 98667 fs/xfs/xfs.o.old Fallout: - xfs_qm_dqattach is split into xfs_qm_dqattach_locked which expects the inode locked and xfs_qm_dqattach which does the locking around it, thus removing XFS_QMOPT_ILOCKED. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
|
#
cc09c0dc |
|
16-Nov-2008 |
Dave Chinner <david@fromorbit.com> |
[XFS] Fix double free of log tickets When an I/O error occurs during an intermediate commit on a rolling transaction, xfs_trans_commit() will free the transaction structure and the related ticket. However, the duplicate transaction that gets used as the transaction continues still contains a pointer to the ticket. Hence when the duplicate transaction is cancelled and freed, we free the ticket a second time. Add reference counting to the ticket so that we hold an extra reference to the ticket over the transaction commit. We drop the extra reference once we have checked that the transaction commit did not return an error, thus avoiding a double free on commit error. Credit to Nick Piggin for tripping over the problem. SGI-PV: 989741 Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
|
#
783a2f65 |
|
30-Oct-2008 |
David Chinner <david@fromorbit.com> |
[XFS] Finish removing the mount pointer from the AIL API Change all the remaining AIL API functions that are passed struct xfs_mount pointers to pass pointers directly to the struct xfs_ail being used. With this conversion, all external access to the AIL is via the struct xfs_ail. Hence the operation and referencing of the AIL is almost entirely independent of the xfs_mount that is using it - it is now much more tightly tied to the log and the items it is tracking in the log than it is tied to the xfs_mount. SGI-PV: 988143 SGI-Modid: xfs-linux-melb:xfs-kern:32353a Signed-off-by: David Chinner <david@fromorbit.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org>
|
#
fc1829f3 |
|
30-Oct-2008 |
David Chinner <david@fromorbit.com> |
[XFS] Add ail pointer into log items Add an xfs_ail pointer to log items so that the log items can reference the AIL directly during callbacks without needed a struct xfs_mount. SGI-PV: 988143 SGI-Modid: xfs-linux-melb:xfs-kern:32352a Signed-off-by: David Chinner <david@fromorbit.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org>
|
#
c7e8f268 |
|
30-Oct-2008 |
David Chinner <david@fromorbit.com> |
[XFS] Move the AIL lock into the struct xfs_ail Bring the ail lock inside the struct xfs_ail. This means the AIL can be entirely manipulated via the struct xfs_ail rather than needing both the struct xfs_mount and the struct xfs_ail. SGI-PV: 988143 SGI-Modid: xfs-linux-melb:xfs-kern:32350a Signed-off-by: David Chinner <david@fromorbit.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org>
|
#
39dab9d7 |
|
13-Aug-2008 |
Eric Sandeen <sandeen@sandeen.net> |
[XFS] remove shouting-indirection macros from xfs_trans.h SGI-PV: 981498 SGI-Modid: xfs-linux-melb:xfs-kern:31758a Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
|
#
322ff6b8 |
|
13-Aug-2008 |
Niv Sardi <xaiki@sgi.com> |
[XFS] Move xfs_attr_rolltrans to xfs_trans_roll Move it from the attr code to the transaction code and make the attr code call the new function. We rolltrans is really usefull whenever we want to use rolling transaction, should be generic, it isn't dependent on any part of the attr code anyway. We use this excuse to change all the: if ((error = xfs_attr_rolltrans())) calls into: error = xfs_trans_roll(); if (error) SGI-PV: 981498 SGI-Modid: xfs-linux-melb:xfs-kern:31729a Signed-off-by: Niv Sardi <xaiki@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
|
#
f0e2d93c |
|
19-May-2008 |
Denys Vlasenko <vda.linux@googlemail.com> |
[XFS] Remove unused arg from kmem_free() kmem_free() function takes (ptr, size) arguments but doesn't actually use second one. This patch removes size argument from all callsites. SGI-PV: 981498 SGI-Modid: xfs-linux-melb:xfs-kern:31050a Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
|
#
413d57c9 |
|
13-Feb-2008 |
Marcin Slusarz <marcin.slusarz@gmail.com> |
xfs: convert beX_add to beX_add_cpu (new common API) remove beX_add functions and replace all uses with beX_add_cpu Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Mark Fasheh <mark.fasheh@oracle.com> Reviewed-by: Dave Chinner <dgc@sgi.com> Cc: Timothy Shimmin <tes@sgi.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
287f3dad |
|
11-Oct-2007 |
Donald Douwsma <donaldd@sgi.com> |
[XFS] Unwrap AIL_LOCK SGI-PV: 970382 SGI-Modid: xfs-linux-melb:xfs-kern:29739a Signed-off-by: Donald Douwsma <donaldd@sgi.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Tim Shimmin <tes@sgi.com>
|
#
b267ce99 |
|
30-Aug-2007 |
Christoph Hellwig <hch@infradead.org> |
[XFS] kill struct bhv_vfs Now that struct bhv_vfs doesn't have any members left we can kill it and go directly from the super_block to the xfs_mount everywhere. SGI-PV: 969608 SGI-Modid: xfs-linux-melb:xfs-kern:29509a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
|
#
2bdf7cd0 |
|
27-Aug-2007 |
Christoph Hellwig <hch@infradead.org> |
[XFS] superblock endianess annotations Creates a new xfs_dsb_t that is __be annotated and keeps xfs_sb_t for the incore one. xfs_xlatesb is renamed to xfs_sb_to_disk and only handles the incore -> disk conversion. A new helper xfs_sb_from_disk handles the other direction and doesn't need the slightly hacky table-driven approach because we only ever read the full sb from disk. The handling of shared r/o filesystems has been buggy on little endian system and fixing this required shuffling around of some code in that area. SGI-PV: 968563 SGI-Modid: xfs-linux-melb:xfs-kern:29477a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
|
#
45c34141 |
|
18-Jun-2007 |
David Chinner <dgc@sgi.com> |
[XFS] Apply transaction delta counts atomically to incore counters With the per-cpu superblock counters, batch updates are no longer atomic across the entire batch of changes. This is not an issue if each individual change in the batch is applied atomically. Unfortunately, free block count changes are not applied atomically, and they are applied in a manner guaranteed to cause problems. Essentially, the free block count reservation that the transaction took initially is returned to the in core counters before a second delta takes away what is used. because these two operations are not atomic, we can race with another thread that can use the returned transaction reservation before the transaction takes the space away again and we can then get ENOSPC being reported in a spot where we don't have an ENOSPC condition, nor should we ever see one there. Fix it up by rolling the two deltas into the one so it can be applied safely (i.e. atomically) to the incore counters. SGI-PV: 964465 SGI-Modid: xfs-linux-melb:xfs-kern:28796a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tim Shimmin <tes@sgi.com>
|
#
210c6f1c |
|
23-May-2007 |
David Chinner <dgc@sgi.com> |
[XFS] Fix the transaction flags to make lazy superblock counters work. SGI-PV: 964999 SGI-Modid: xfs-linux-melb:xfs-kern:28653a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tim Shimmin <tes@sgi.com>
|
#
92821e2b |
|
23-May-2007 |
David Chinner <dgc@sgi.com> |
[XFS] Lazy Superblock Counters When we have a couple of hundred transactions on the fly at once, they all typically modify the on disk superblock in some way. create/unclink/mkdir/rmdir modify inode counts, allocation/freeing modify free block counts. When these counts are modified in a transaction, they must eventually lock the superblock buffer and apply the mods. The buffer then remains locked until the transaction is committed into the incore log buffer. The result of this is that with enough transactions on the fly the incore superblock buffer becomes a bottleneck. The result of contention on the incore superblock buffer is that transaction rates fall - the more pressure that is put on the superblock buffer, the slower things go. The key to removing the contention is to not require the superblock fields in question to be locked. We do that by not marking the superblock dirty in the transaction. IOWs, we modify the incore superblock but do not modify the cached superblock buffer. In short, we do not log superblock modifications to critical fields in the superblock on every transaction. In fact we only do it just before we write the superblock to disk every sync period or just before unmount. This creates an interesting problem - if we don't log or write out the fields in every transaction, then how do the values get recovered after a crash? the answer is simple - we keep enough duplicate, logged information in other structures that we can reconstruct the correct count after log recovery has been performed. It is the AGF and AGI structures that contain the duplicate information; after recovery, we walk every AGI and AGF and sum their individual counters to get the correct value, and we do a transaction into the log to correct them. An optimisation of this is that if we have a clean unmount record, we know the value in the superblock is correct, so we can avoid the summation walk under normal conditions and so mount/recovery times do not change under normal operation. One wrinkle that was discovered during development was that the blocks used in the freespace btrees are never accounted for in the AGF counters. This was once a valid optimisation to make; when the filesystem is full, the free space btrees are empty and consume no space. Hence when it matters, the "accounting" is correct. But that means the when we do the AGF summations, we would not have a correct count and xfs_check would complain. Hence a new counter was added to track the number of blocks used by the free space btrees. This is an *on-disk format change*. As a result of this, lazy superblock counters are a mkfs option and at the moment on linux there is no way to convert an old filesystem. This is possible - xfs_db can be used to twiddle the right bits and then xfs_repair will do the format conversion for you. Similarly, you can convert backwards as well. At some point we'll add functionality to xfs_admin to do the bit twiddling easily.... SGI-PV: 964999 SGI-Modid: xfs-linux-melb:xfs-kern:28652a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tim Shimmin <tes@sgi.com>
|
#
1c72bf90 |
|
07-May-2007 |
Eric Sandeen <sandeen@sandeen.net> |
[XFS] The last argument "lsn" of xfs_trans_commit() is always called with NULL. Patch provided by Eric Sandeen. SGI-PV: 961693 SGI-Modid: xfs-linux-melb:xfs-kern:28199a Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Tim Shimmin <tes@sgi.com>
|
#
20f4ebf2 |
|
10-Feb-2007 |
David Chinner <dgc@sgi.com> |
[XFS] Make growfs work for amounts greater than 2TB The free block modification code has a 32bit interface, limiting the size the filesystem can be grown even on 64 bit machines. On 32 bit machines, there are other 32bit variables in transaction structures and interfaces that need to be expanded to allow this to work. SGI-PV: 959978 SGI-Modid: xfs-linux-melb:xfs-kern:27894a Signed-off-by: David Chinner <dgc@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tim Shimmin <tes@sgi.com>
|
#
f6c2d1fa |
|
19-Jun-2006 |
Nathan Scott <nathans@sgi.com> |
[XFS] Remove version 1 directory code. Never functioned on Linux, just pure bloat. SGI-PV: 952969 SGI-Modid: xfs-linux-melb:xfs-kern:26251a Signed-off-by: Nathan Scott <nathans@sgi.com>
|
#
34327e13 |
|
09-Jun-2006 |
Nathan Scott <nathans@sgi.com> |
[XFS] Cleanup a missed porting conversion, and freezing. SGI-PV: 953338 SGI-Modid: xfs-linux-melb:xfs-kern:26109a Signed-off-by: Nathan Scott <nathans@sgi.com>
|
#
59c1b082 |
|
08-Jun-2006 |
Nathan Scott <nathans@sgi.com> |
[XFS] Make the pflags test/set wrappers more legible for us mere humans. SGI-PV: 953338 SGI-Modid: xfs-linux-melb:xfs-kern:26099a Signed-off-by: Nathan Scott <nathans@sgi.com>
|
#
7d04a335 |
|
08-Jun-2006 |
Nathan Scott <nathans@sgi.com> |
[XFS] Shutdown the filesystem if all device paths have gone. Made shutdown vop flags consistent with sync vop flags declarations too. SGI-PV: 939911 SGI-Modid: xfs-linux-melb:xfs-kern:26096a Signed-off-by: Nathan Scott <nathans@sgi.com>
|
#
c41564b5 |
|
28-Mar-2006 |
Nathan Scott <nathans@sgi.com> |
[XFS] We really suck at spulling. Thanks to Chris Pascoe for fixing all these typos. SGI-PV: 904196 SGI-Modid: xfs-linux-melb:xfs-kern:25539a Signed-off-by: Nathan Scott <nathans@sgi.com>
|
#
2ddd5928 |
|
16-Mar-2006 |
Nathan Scott <nathans@sgi.com> |
[XFS] Correct the dquot reservation component for the link transation. SGI-PV: 904196 SGI-Modid: xfs-linux-melb:xfs-kern:25476a Signed-off-by: Nathan Scott <nathans@sgi.com>
|
#
8f794055 |
|
13-Mar-2006 |
Nathan Scott <nathans@sgi.com> |
[XFS] Reduce complexity in xfs_trans_init by pushing complex macros out into functions and hence reduce the stack footprint there. SGI-PV: 947312 SGI-Modid: xfs-linux-melb:xfs-kern:25360a Signed-off-by: Nathan Scott <nathans@sgi.com>
|
#
60a204f0 |
|
10-Jan-2006 |
Nathan Scott <nathans@sgi.com> |
[XFS] Fix a thinko when generating a forced shutdown stack trace. SGI-PV: 929558 SGI-Modid: xfs-linux-melb:xfs-kern:203817a Signed-off-by: Nathan Scott <nathans@sgi.com>
|
#
0733af21 |
|
10-Jan-2006 |
Ryan Hankins <hankins@sgi.com> |
[XFS] Add a stack trace in the case of xfs_forced_shutdown. SGI-PV: 929558 SGI-Modid: xfs-linux-melb:xfs-kern:203701a Signed-off-by: Ryan Hankins <hankins@sgi.com> Signed-off-by: Nathan Scott <nathans@sgi.com>
|
#
cfcbbbd0 |
|
01-Nov-2005 |
Nathan Scott <nathans@sgi.com> |
[XFS] Remove old, broken nolog-mode code - noone plans to ever fix it. SGI-PV: 944821 SGI-Modid: xfs-linux:xfs-kern:24213a Signed-off-by: Nathan Scott <nathans@sgi.com>
|
#
7b718769 |
|
01-Nov-2005 |
Nathan Scott <nathans@sgi.com> |
[XFS] Update license/copyright notices to match the prefered SGI boilerplate. SGI-PV: 913862 SGI-Modid: xfs-linux:xfs-kern:23903a Signed-off-by: Nathan Scott <nathans@sgi.com>
|
#
a844f451 |
|
01-Nov-2005 |
Nathan Scott <nathans@sgi.com> |
[XFS] Remove xfs_macros.c, xfs_macros.h, rework headers a whole lot. SGI-PV: 943122 SGI-Modid: xfs-linux:xfs-kern:23901a Signed-off-by: Nathan Scott <nathans@sgi.com>
|
#
f538d4da |
|
01-Nov-2005 |
Christoph Hellwig <hch@sgi.com> |
[XFS] write barrier support Issue all log sync operations as ordered writes. In addition flush the disk cache on fsync if the sync cached operation didn't sync the log to disk (this requires some additional bookeping in the transaction and log code). If the device doesn't claim to support barriers, the filesystem has an extern log volume or the trial superblock write with barriers enabled failed we disable barriers and print a warning. We should probably fail the mount completely, but that could lead to nasty boot failures for the root filesystem. Not enabled by default yet, needs more destructive testing first. SGI-PV: 912426 SGI-Modid: xfs-linux:xfs-kern:198723a Signed-off-by: Christoph Hellwig <hch@sgi.com> Signed-off-by: Nathan Scott <nathans@sgi.com>
|
#
7e9c6396 |
|
02-Sep-2005 |
Tim Shimmin <tes@sgi.com> |
[XFS] 929956 add log debugging and tracing info SGI-PV: 931456 SGI-Modid: xfs-linux:xfs-kern:23155a Signed-off-by: Tim Shimmin <tes@sgi.com> Signed-off-by: Nathan Scott <nathans@sgi.com>
|
#
ba0f32d4 |
|
20-Jun-2005 |
Christoph Hellwig <hch@sgi.com> |
[XFS] mark various symbols static Patch from Adrian Bunk SGI-PV: 936255 SGI-Modid: xfs-linux:xfs-kern:192760a Signed-off-by: Christoph Hellwig <hch@sgi.com> Signed-off-by: Nathan Scott <nathans@sgi.com>
|
#
4372d6e1 |
|
20-Jun-2005 |
Christoph Hellwig <hch@sgi.com> |
[XFS] Remove dead code. Patch from Adrian Bunk SGI-PV: 936255 SGI-Modid: xfs-linux:xfs-kern:192759a Signed-off-by: Christoph Hellwig <hch@sgi.com> Signed-off-by: Nathan Scott <nathans@sgi.com>
|
#
1da177e4 |
|
16-Apr-2005 |
Linus Torvalds <torvalds@ppc970.osdl.org> |
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!
|