History log of /linux-master/fs/xfs/xfs_mount.h
Revision Date Author Comments
# b74e15d7 16-Sep-2021 Darrick J. Wong <djwong@kernel.org>

xfs: compute maximum AG btree height for critical reservation calculation

Compute the actual maximum AG btree height for deciding if a per-AG
block reservation is critically low. This only affects the sanity check
condition, since we /generally/ will trigger on the 10% threshold. This
is a long-winded way of saying that we're removing one more usage of
XFS_BTREE_MAXLEVELS.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>


# 7cb3efb4 13-Oct-2021 Darrick J. Wong <djwong@kernel.org>

xfs: rename m_ag_maxlevels to m_allocbt_maxlevels

Years ago when XFS was thought to be much more simple, we introduced
m_ag_maxlevels to specify the maximum btree height of per-AG btrees for
a given filesystem mount. Then we observed that inode btrees don't
actually have the same height and split that off; and now we have rmap
and refcount btrees with much different geometries and separate
maxlevels variables.

The 'ag' part of the name doesn't make much sense anymore, so rename
this to m_alloc_maxlevels to reinforce that this is the maximum height
of the *free space* btrees. This sets us up for the next patch, which
will add a variable to track the maximum height of all AG btrees.

(Also take the opportunity to improve adjacent comments and fix minor
style problems.)

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>


# 75c8c50f 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdown

Remove the shouty macro and instead use the inline function that
matches other state/feature check wrapper naming. This conversion
was done with sed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>


# 2e973b2c 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: convert remaining mount flags to state flags

The remaining mount flags kept in m_flags are actually runtime state
flags. These change dynamically, so they really should be updated
atomically so we don't potentially lose an update due to racing
modifications.

Convert these remaining flags to be stored in m_opstate and use
atomic bitops to set and clear the flags. This also adds a couple of
simple wrappers for common state checks - read only and shutdown.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>


# 0560f31a 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: convert mount flags to features

Replace m_flags feature checks with xfs_has_<feature>() calls and
rework the setup code to set flags in m_features.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>


# 8970a5b8 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: consolidate mount option features in m_features

This provides separation of mount time feature flags from runtime
mount flags and mount option state. It also makes the feature
checks use the same interface as the superblock features. i.e. we
don't care if the feature is enabled by superblock flags or mount
options, we just care if it's enabled or not.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>


# 38c26bfd 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: replace xfs_sb_version checks with feature flag checks

Convert the xfs_sb_version_hasfoo() to checks against
mp->m_features. Checks of the superblock itself during disk
operations (e.g. in the read/write verifiers and the to/from disk
formatters) are not converted - they operate purely on the
superblock state. Everything else should use the mount features.

Large parts of this conversion were done with sed with commands like
this:

for f in `git grep -l xfs_sb_version_has fs/xfs/*.c`; do
sed -i -e 's/xfs_sb_version_has\(.*\)(&\(.*\)->m_sb)/xfs_has_\1(\2)/' $f
done

With manual cleanups for things like "xfs_has_extflgbit" and other
little inconsistencies in naming.

The result is ia lot less typing to check features and an XFS binary
size reduced by a bit over 3kB:

$ size -t fs/xfs/built-in.a
text data bss dec hex filenam
before 1130866 311352 484 1442702 16038e (TOTALS)
after 1127727 311352 484 1439563 15f74b (TOTALS)

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>


# a1d86e8d 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: reflect sb features in xfs_mount

Currently on-disk feature checks require decoding the superblock
fileds and so can be non-trivial. We have almost 400 hundred
individual feature checks in the XFS code, so this is a significant
amount of code. To reduce runtime check overhead, pre-process all
the version flags into a features field in the xfs_mount at mount
time so we can convert all the feature checks to a simple flag
check.

There is also a need to convert the dynamic feature flags to update
the m_features field. This is required for attr, attr2 and quota
features. New xfs_mount based wrappers are added for this.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>


# 7f89c838 10-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: add trace point for fs shutdown

Add a tracepoint for fs shutdowns so we can capture that in ftrace
output.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 33c0dd78 10-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: move the CIL workqueue to the CIL

We only use the CIL workqueue in the CIL, so it makes no sense to
hang it off the xfs_mount and have to walk multiple pointers back up
to the mount when we have the CIL structures right there.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>


# 908ce71e 08-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: allow setting and clearing of log incompat feature flags

Log incompat feature flags in the superblock exist for one purpose: to
protect the contents of a dirty log from replay on a kernel that isn't
prepared to handle those dirty contents. This means that they can be
cleared if (a) we know the log is clean and (b) we know that there
aren't any other threads in the system that might be setting or relying
upon a log incompat flag.

Therefore, clear the log incompat flags when we've finished recovering
the log, when we're unmounting cleanly, remounting read-only, or
freezing; and provide a function so that subsequent patches can start
using this.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>


# 40b1de007 06-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: throttle inode inactivation queuing on memory reclaim

Now that we defer inode inactivation, we've decoupled the process of
unlinking or closing an inode from the process of inactivating it. In
theory this should lead to better throughput since we now inactivate the
queued inodes in batches instead of one at a time.

Unfortunately, one of the primary risks with this decoupling is the loss
of rate control feedback between the frontend and background threads.
In other words, a rm -rf /* thread can run the system out of memory if
it can queue inodes for inactivation and jump to a new CPU faster than
the background threads can actually clear the deferred work. The
workers can get scheduled off the CPU if they have to do IO, etc.

To solve this problem, we configure a shrinker so that it will activate
the /second/ time the shrinkers are called. The custom shrinker will
queue all percpu deferred inactivation workers immediately and set a
flag to force frontend callers who are releasing a vfs inode to wait for
the inactivation workers.

On my test VM with 560M of RAM and a 2TB filesystem, this seems to solve
most of the OOMing problem when deleting 10 million inodes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>


# 6f649091 06-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: don't run speculative preallocation gc when fs is frozen

Now that we have the infrastructure to switch background workers on and
off at will, fix the block gc worker code so that we don't actually run
the worker when the filesystem is frozen, same as we do for deferred
inactivation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>


# 65f03d86 06-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: queue inactivation immediately when free realtime extents are tight

Now that we have made the inactivation of unlinked inodes a background
task to increase the throughput of file deletions, we need to be a
little more careful about how long of a delay we can tolerate.

Similar to the patch doing this for free space on the data device, if
the file being inactivated is a realtime file and the realtime volume is
running low on free extents, we want to run the worker ASAP so that the
realtime allocator can make better decisions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>


# 7d6f07d2 06-Aug-2021 Darrick J. Wong <djwong@kernel.org>

xfs: queue inactivation immediately when free space is tight

Now that we have made the inactivation of unlinked inodes a background
task to increase the throughput of file deletions, we need to be a
little more careful about how long of a delay we can tolerate.

On a mostly empty filesystem, the risk of the allocator making poor
decisions due to fragmentation of the free space on account a lengthy
delay in background updates is minimal because there's plenty of space.
However, if free space is tight, we want to deallocate unlinked inodes
as quickly as possible to avoid fallocate ENOSPC and to give the
allocator the best shot at optimal allocations for new writes.

Therefore, queue the percpu worker immediately if the filesystem is more
than 95% full. This follows the same principle that XFS becomes less
aggressive about speculative allocations and lazy cleanup (and more
precise about accounting) when nearing full.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>


# ab23a776 06-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: per-cpu deferred inode inactivation queues

Move inode inactivation to background work contexts so that it no
longer runs in the context that releases the final reference to an
inode. This will allow process work that ends up blocking on
inactivation to continue doing work while the filesytem processes
the inactivation in the background.

A typical demonstration of this is unlinking an inode with lots of
extents. The extents are removed during inactivation, so this blocks
the process that unlinked the inode from the directory structure. By
moving the inactivation to the background process, the userspace
applicaiton can keep working (e.g. unlinking the next inode in the
directory) while the inactivation work on the previous inode is
done by a different CPU.

The implementation of the queue is relatively simple. We use a
per-cpu lockless linked list (llist) to queue inodes for
inactivation without requiring serialisation mechanisms, and a work
item to allow the queue to be processed by a CPU bound worker
thread. We also keep a count of the queue depth so that we can
trigger work after a number of deferred inactivations have been
queued.

The use of a bound workqueue with a single work depth allows the
workqueue to run one work item per CPU. We queue the work item on
the CPU we are currently running on, and so this essentially gives
us affine per-cpu worker threads for the per-cpu queues. THis
maintains the effective CPU affinity that occurs within XFS at the
AG level due to all objects in a directory being local to an AG.
Hence inactivation work tends to run on the same CPU that last
accessed all the objects that inactivation accesses and this
maintains hot CPU caches for unlink workloads.

A depth of 32 inodes was chosen to match the number of inodes in an
inode cluster buffer. This hopefully allows sequential
allocation/unlink behaviours to defering inactivation of all the
inodes in a single cluster buffer at a time, further helping
maintain hot CPU and buffer cache accesses while running
inactivations.

A hard per-cpu queue throttle of 256 inode has been set to avoid
runaway queuing when inodes that take a long to time inactivate are
being processed. For example, when unlinking inodes with large
numbers of extents that can take a lot of processing to free.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[djwong: tweak comments and tracepoints, convert opflags to state bits]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>


# 0ed17f01 06-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: introduce all-mounts list for cpu hotplug notifications

The inode inactivation and CIL tracking percpu structures are
per-xfs_mount structures. That means when we get a CPU dead
notification, we need to then iterate all the per-cpu structure
instances to process them. Rather than keeping linked lists of
per-cpu structures in each subsystem, add a list of all xfs_mounts
that the generic xfs_cpu_dead() function will iterate and call into
each subsystem appropriately.

This allows us to handle both per-mount and global XFS percpu state
from xfs_cpu_dead(), and avoids the need to link subsystem
structures that can be easily found from the xfs_mount into their
own global lists.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[djwong: expand some comments about mount list setup ordering rules]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>


# 07b6403a 01-Jun-2021 Dave Chinner <dchinner@redhat.com>

xfs: move perag structure and setup to libxfs/xfs_ag.[ch]

Move the xfs_perag infrastructure to the libxfs files that contain
all the per AG infrastructure. This helps set up for passing perags
around all the code instead of bare agnos with minimal extra
includes for existing files.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>


# 61aa005a 01-Jun-2021 Dave Chinner <dchinner@redhat.com>

xfs: prepare for moving perag definitions and support to libxfs

The perag structures really need to be defined with the rest of the
AG support infrastructure. The struct xfs_perag and init/teardown
has been placed in xfs_mount.[ch] because there are differences in
the structure between kernel and userspace. Mainly that userspace
doesn't have a lot of the internal stuff that the kernel has for
caches and discard and other such structures.

However, it makes more sense to move this to libxfs than to keep
this separation because we are now moving to use struct perags
everywhere in the code instead of passing raw agnumber_t values
about. Hence we shoudl really move the support infrastructure to
libxfs/xfs_ag.[ch].

To do this without breaking userspace, first we need to rearrange
the structures and code so that all the kernel specific code is
located together. This makes it simple for userspace to ifdef out
the all the parts it does not need, minimising the code differences
between kernel and userspace. The next commit will do the move...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>


# 16eaab83 28-Apr-2021 Brian Foster <bfoster@redhat.com>

xfs: introduce in-core global counter of allocbt blocks

Introduce an in-core counter to track the sum of all allocbt blocks
used by the filesystem. This value is currently tracked per-ag via
the ->agf_btreeblks field in the AGF, which also happens to include
rmapbt blocks. A global, in-core count of allocbt blocks is required
to identify the subset of global ->m_fdblocks that consists of
unavailable blocks currently used for allocation btrees. To support
this calculation at block reservation time, construct a similar
global counter for allocbt blocks, populate it on first read of each
AGF and update it as allocbt blocks are used and released.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>


# 3fef46fc 22-Mar-2021 Darrick J. Wong <djwong@kernel.org>

xfs: rename the blockgc workqueue

Since we're about to start using the blockgc workqueue to dispose of
inactivated inodes, strip the "block" prefix from the name; now it's
merely the general garbage collection (gc) workqueue.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 894ecacf 22-Jan-2021 Darrick J. Wong <djwong@kernel.org>

xfs: parallelize block preallocation garbage collection

Split the block preallocation garbage collection work into per-AG work
items so that we can take advantage of parallelization.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 9669f51d 22-Jan-2021 Darrick J. Wong <djwong@kernel.org>

xfs: consolidate the eofblocks and cowblocks workers

Remove the separate cowblocks work items and knob so that we can control
and run everything from a single blockgc work queue. Note that the
speculative_prealloc_lifetime sysfs knob retains its historical name
even though the functions move to prefix xfs_blockgc_*.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# f46e5a17 22-Jan-2021 Brian Foster <bfoster@redhat.com>

xfs: fold sbcount quiesce logging into log covering

xfs_log_sbcount() calls xfs_sync_sb() to sync superblock counters to
disk when lazy superblock accounting is enabled. This occurs on
unmount, freeze, and read-only (re)mount and ensures the final
values are calculated and persisted to disk before each form of
quiesce completes.

Now that log covering occurs in all of these contexts and uses the
same xfs_sync_sb() mechanism to update log state, there is no need
to log the superblock separately for any reason. Update the log
quiesce path to sync the superblock at least once for any mount
where lazy superblock accounting is enabled. If the log is already
covered, it will remain in the covered state. Otherwise, the next
sync as part of the normal covering sequence will carry the
associated superblock update with it. Remove xfs_log_sbcount() now
that it is no longer needed.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>


# b3f8e08c 01-Sep-2020 Christoph Hellwig <hch@lst.de>

xfs: remove xfs_getsb

Merge xfs_getsb into its only caller, and clean that one up a little bit
as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 0e8e2c63 29-Jun-2020 Dave Chinner <dchinner@redhat.com>

xfs: allow multiple reclaimers per AG

Inode reclaim will still throttle direct reclaim on the per-ag
reclaim locks. This is no longer necessary as reclaim can run
non-blocking now. Hence we can remove these locks so that we don't
arbitrarily block reclaimers just because there are more direct
reclaimers than there are AGs.

This can result in multiple reclaimers working on the same range of
an AG, but this doesn't cause any apparent issues. Optimising the
spread of concurrent reclaimers for best efficiency can be done in a
future patchset.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 02beb268 30-Apr-2020 Ira Weiny <ira.weiny@intel.com>

fs/xfs: Make DAX mount option a tri-state

As agreed upon[1]. We make the dax mount option a tri-state. '-o dax'
continues to operate the same. We add 'always', 'never', and 'inode'
(default).

[1] https://lore.kernel.org/lkml/20200405061945.GA94792@iweiny-DESK2.sc.intel.com/

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# fd58c62b 30-Apr-2020 Ira Weiny <ira.weiny@intel.com>

fs/xfs: Change XFS_MOUNT_DAX to XFS_MOUNT_DAX_ALWAYS

In prep for the new tri-state mount option which then introduces
XFS_MOUNT_DAX_NEVER.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# b41b46c2 20-May-2020 Dave Chinner <dchinner@redhat.com>

xfs: remove the m_active_trans counter

It's a global atomic counter, and we are hitting it at a rate of
half a million transactions a second, so it's bouncing the counter
cacheline all over the place on large machines. We don't actually
need it anymore - it used to be required because the VFS freeze code
could not track/prevent filesystem transactions that were running,
but that problem no longer exists.

Hence to remove the counter, we simply have to ensure that nothing
calls xfs_sync_sb() while we are trying to quiesce the filesytem.
That only happens if the log worker is still running when we call
xfs_quiesce_attr(). The log worker is cancelled at the end of
xfs_quiesce_attr() by calling xfs_log_quiesce(), so just call it
early here and then we can remove the counter altogether.

Concurrent create, 50 million inodes, identical 16p/16GB virtual
machines on different physical hosts. Machine A has twice the CPU
cores per socket of machine B:

unpatched patched
machine A: 3m16s 2m00s
machine B: 4m04s 4m05s

Create rates:
unpatched patched
machine A: 282k+/-31k 468k+/-21k
machine B: 231k+/-8k 233k+/-11k

Concurrent rm of same 50 million inodes:

unpatched patched
machine A: 6m42s 2m33s
machine B: 4m47s 4m47s

The transaction rate on the fast machine went from just under
300k/sec to 700k/sec, which indicates just how much of a bottleneck
this atomic counter was.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# b0dff466 20-May-2020 Dave Chinner <dchinner@redhat.com>

xfs: separate read-only variables in struct xfs_mount

Seeing massive cpu usage from xfs_agino_range() on one machine;
instruction level profiles look similar to another machine running
the same workload, only one machine is consuming 10x as much CPU as
the other and going much slower. The only real difference between
the two machines is core count per socket. Both are running
identical 16p/16GB virtual machine configurations

Machine A:

25.83% [k] xfs_agino_range
12.68% [k] __xfs_dir3_data_check
6.95% [k] xfs_verify_ino
6.78% [k] xfs_dir2_data_entry_tag_p
3.56% [k] xfs_buf_find
2.31% [k] xfs_verify_dir_ino
2.02% [k] xfs_dabuf_map.constprop.0
1.65% [k] xfs_ag_block_count

And takes around 13 minutes to remove 50 million inodes.

Machine B:

13.90% [k] __pv_queued_spin_lock_slowpath
3.76% [k] do_raw_spin_lock
2.83% [k] xfs_dir3_leaf_check_int
2.75% [k] xfs_agino_range
2.51% [k] __raw_callee_save___pv_queued_spin_unlock
2.18% [k] __xfs_dir3_data_check
2.02% [k] xfs_log_commit_cil

And takes around 5m30s to remove 50 million inodes.

Suspect is cacheline contention on m_sectbb_log which is used in one
of the macros in xfs_agino_range. This is a read-only variable but
shares a cacheline with m_active_trans which is a global atomic that
gets bounced all around the machine.

The workload is trying to run hundreds of thousands of transactions
per second and hence cacheline contention will be occurring on this
atomic counter. Hence xfs_agino_range() is likely just be an
innocent bystander as the cache coherency protocol fights over the
cacheline between CPU cores and sockets.

On machine A, this rearrangement of the struct xfs_mount
results in the profile changing to:

9.77% [kernel] [k] xfs_agino_range
6.27% [kernel] [k] __xfs_dir3_data_check
5.31% [kernel] [k] __pv_queued_spin_lock_slowpath
4.54% [kernel] [k] xfs_buf_find
3.79% [kernel] [k] do_raw_spin_lock
3.39% [kernel] [k] xfs_verify_ino
2.73% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock

Vastly less CPU usage in xfs_agino_range(), but still 3x the amount
of machine B and still runs substantially slower than it should.

Current rm -rf of 50 million files:

vanilla patched
machine A 13m20s 6m42s
machine B 5m30s 5m02s

It's an improvement, hence indicating that separation and further
optimisation of read-only global filesystem data is worthwhile, but
it clearly isn't the underlying issue causing this specific
performance degradation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# f18c9a90 20-May-2020 Dave Chinner <dchinner@redhat.com>

xfs: reduce free inode accounting overhead

Shaokun Zhang reported that XFS was using substantial CPU time in
percpu_count_sum() when running a single threaded benchmark on
a high CPU count (128p) machine from xfs_mod_ifree(). The issue
is that the filesystem is empty when the benchmark runs, so inode
allocation is running with a very low inode free count.

With the percpu counter batching, this means comparisons when the
counter is less that 128 * 256 = 32768 use the slow path of adding
up all the counters across the CPUs, and this is expensive on high
CPU count machines.

The summing in xfs_mod_ifree() is only used to fire an assert if an
underrun occurs. The error is ignored by the higher level code.
Hence this is really just debug code and we don't need to run it
on production kernels, nor do we need such debug checks to return
error values just to trigger an assert.

Finally, xfs_mod_icount/xfs_mod_ifree are only called from
xfs_trans_unreserve_and_mod_sb(), so get rid of them and just
directly call the percpu_counter_add/percpu_counter_compare
functions. The compare functions are now run only on debug builds as
they are internal to ASSERT() checks and so only compiled in when
ASSERTs are active (CONFIG_XFS_DEBUG=y or CONFIG_XFS_WARN=y).

Reported-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 28d84620 06-May-2020 Brian Foster <bfoster@redhat.com>

xfs: remove unused shutdown types

Both types control shutdown messaging and neither is used in the
current codebase.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Collins <allison.henderson@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 8d6c3446 04-May-2020 Ira Weiny <ira.weiny@intel.com>

fs/xfs: Make DAX mount option a tri-state

As agreed upon[1]. We make the dax mount option a tri-state. '-o dax'
continues to operate the same. We add 'always', 'never', and 'inode'
(default).

[1] https://lore.kernel.org/lkml/20200405061945.GA94792@iweiny-DESK2.sc.intel.com/

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 606723d9 04-May-2020 Ira Weiny <ira.weiny@intel.com>

fs/xfs: Change XFS_MOUNT_DAX to XFS_MOUNT_DAX_ALWAYS

In prep for the new tri-state mount option which then introduces
XFS_MOUNT_DAX_NEVER.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# f0f7a674 12-Apr-2020 Darrick J. Wong <darrick.wong@oracle.com>

xfs: move inode flush to the sync workqueue

Move the inode dirty data flushing to a workqueue so that multiple
threads can take advantage of a single thread's flushing work. The
ratelimiting technique used in bdd4ee4 was not successful, because
threads that skipped the inode flush scan due to ratelimiting would
ENOSPC early, which caused occasional (but noticeable) changes in
behavior and sporadic fstest regressions.

Therefore, make all the writer threads wait on a single inode flush,
which eliminates both the stampeding hordes of flushers and the small
window in which a write could fail with ENOSPC because it lost the
ratelimit race after even another thread freed space.

Fixes: c6425702f21e ("xfs: ratelimit inode flush on buffered write ENOSPC")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# c6425702 27-Mar-2020 Darrick J. Wong <darrick.wong@oracle.com>

xfs: ratelimit inode flush on buffered write ENOSPC

A customer reported rcu stalls and softlockup warnings on a computer
with many CPU cores and many many more IO threads trying to write to a
filesystem that is totally out of space. Subsequent analysis pointed to
the many many IO threads calling xfs_flush_inodes -> sync_inodes_sb,
which causes a lot of wb_writeback_work to be queued. The writeback
worker spends so much time trying to wake the many many threads waiting
for writeback completion that it trips the softlockup detector, and (in
this case) the system automatically reboots.

In addition, they complain that the lengthy xfs_flush_inodes scan traps
all of those threads in uninterruptible sleep, which hampers their
ability to kill the program or do anything else to escape the situation.

If there's thousands of threads trying to write to files on a full
filesystem, each of those threads will start separate copies of the
inode flush scan. This is kind of pointless since we only need one
scan, so rate limit the inode flush.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>


# a55cefcc 12-Nov-2019 Eric Sandeen <sandeen@sandeen.net>

xfs: remove unused structure members & simple typedefs

Remove some unused typedef'd simple types, and some unused
structure members.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# d8d11fc7 11-Nov-2019 Christoph Hellwig <hch@lst.de>

xfs: devirtualize ->m_dirnameops

Instead of causing a relatively expensive indirect call for each
hashing and comparism of a file name in a directory just use an
inline function and a simple branch on the ASCII CI bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix unused variable warning]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 537dabcf 11-Nov-2019 Christoph Hellwig <hch@lst.de>

xfs: remove the unused m_chsize field

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 957ee13e 08-Nov-2019 Christoph Hellwig <hch@lst.de>

xfs: remove the now unused dir ops infrastructure

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 3b344413 08-Nov-2019 Christoph Hellwig <hch@lst.de>

xfs: move the node header size to struct xfs_da_geometry

Move the node header size field to struct xfs_da_geometry, and remove
the now unused non-directory dir ops infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# e1d3d218 04-Nov-2019 Ian Kent <raven@themaw.net>

xfs: use super s_id instead of struct xfs_mount m_fsname

Eliminate struct xfs_mount field m_fsname by using the super block s_id
field directly.

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# f676c750 04-Nov-2019 Ian Kent <raven@themaw.net>

xfs: remove unused struct xfs_mount field m_fsname_len

The struct xfs_mount field m_fsname_len is not used anywhere, remove it.

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 7c6b94b1 28-Oct-2019 Christoph Hellwig <hch@lst.de>

xfs: reverse the polarity of XFS_MOUNT_COMPAT_IOSIZE

Replace XFS_MOUNT_COMPAT_IOSIZE with an inverted XFS_MOUNT_LARGEIO flag
that makes the usage more clear.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 3274d008 28-Oct-2019 Christoph Hellwig <hch@lst.de>

xfs: rename the XFS_MOUNT_DFLT_IOSIZE option to

Make the flag match the mount option and usage.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 2fcddee8 28-Oct-2019 Christoph Hellwig <hch@lst.de>

xfs: simplify parsing of allocsize mount option

Rework xfs_parseargs to fill out the default value and then parse the
option directly into the mount structure, similar to what we do for
other updates, and open code the now trivial updates based on on the
on-disk superblock directly into xfs_mountfs.

Note that this change rejects the allocsize=0 mount option that has been
documented as invalid for a long time instead of just ignoring it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 5da8a07c 28-Oct-2019 Christoph Hellwig <hch@lst.de>

xfs: rename the m_writeio_* fields in struct xfs_mount

Use the allocsize name to match the mount option and usage instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 3cd1d18b 28-Oct-2019 Christoph Hellwig <hch@lst.de>

xfs: remove the m_readio_* fields in struct xfs_mount

m_readio_blocks is entirely unused, and m_readio_blocks is only used in
xfs_stat_blksize in a max statements that is a no-op as it always has
the same value as m_writeio_log.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# b5ad616c 28-Oct-2019 Christoph Hellwig <hch@lst.de>

xfs: don't use a different allocsice for -o wsync

The -o wsync allocsize overwrite overwrite was part of a special hack
for NFSv2 servers in IRIX and has no real purpose in modern Linux, so
remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# dd2d535e 28-Oct-2019 Christoph Hellwig <hch@lst.de>

xfs: cleanup calculating the stat optimal I/O size

Move xfs_preferred_iosize to xfs_iops.c, unobsfucate it and also handle
the realtime special case in the helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 0ad95687 26-Aug-2019 Dave Chinner <dchinner@redhat.com>

xfs: add kmem allocation trace points

When trying to correlate XFS kernel allocations to memory reclaim
behaviour, it is useful to know what allocations XFS is actually
attempting. This information is not directly available from
tracepoints in the generic memory allocation and reclaim
tracepoints, so these new trace points provide a high level
indication of what the XFS memory demand actually is.

There is no per-filesystem context in this code, so we just trace
the type of allocation, the size and the allocation constraints.
The kmem code also doesn't include much of the common XFS headers,
so there are a few definitions that need to be added to the trace
headers and a couple of types that need to be made common to avoid
needing to include the whole world in the kmem code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 1058d0f5 28-Jun-2019 Christoph Hellwig <hch@lst.de>

xfs: move the log ioend workqueue to struct xlog

Move the workqueue used for log I/O completions from struct xfs_mount
to struct xlog to keep it self contained in the log code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: destroy the log workqueue after ensuring log ios are done]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 8c9ce2f7 12-Jun-2019 Eric Sandeen <sandeen@sandeen.net>

xfs: remove unused flags arg from getsb interfaces

The flags value is always passed as 0 so remove the argument.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# ef325959 05-Jun-2019 Darrick J. Wong <darrick.wong@oracle.com>

xfs: separate inode geometry

Separate the inode geometry information into a distinct structure.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>


# 9fe82b8c 25-Apr-2019 Darrick J. Wong <darrick.wong@oracle.com>

xfs: track delayed allocation reservations across the filesystem

Add a percpu counter to track the number of blocks directly reserved for
delayed allocations on the data device. This counter (in contrast to
i_delayed_blks) does not track allocated CoW staging extents or anything
going on with the realtime device. It will be used in the upcoming
summary counter scrub function to check the free block counts without
having to freeze the filesystem or walk all the inodes to find the
delayed allocations.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>


# 28408243 15-Apr-2019 Darrick J. Wong <darrick.wong@oracle.com>

xfs: remove unused m_data_workqueue

Now that we're no longer using m_data_workqueue, remove it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 39353ff6 12-Apr-2019 Darrick J. Wong <darrick.wong@oracle.com>

xfs: replace the BAD_SUMMARY mount flag with the equivalent health code

Replace the BAD_SUMMARY mount flag with calls to the equivalent health
tracking code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 6772c1f1 12-Apr-2019 Darrick J. Wong <darrick.wong@oracle.com>

xfs: track metadata health status

Add the necessary in-core metadata fields to keep track of which parts
of the filesystem have been observed and which parts were observed to be
unhealthy, and print a warning at unmount time if we have unfixed
problems.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 66ae56a5 18-Feb-2019 Christoph Hellwig <hch@lst.de>

xfs: introduce an always_cow mode

Add a mode where XFS never overwrites existing blocks in place. This
is to aid debugging our COW code, and also put infatructure in place
for things like possible future support for zoned block devices, which
can't support overwrites.

This mode is enabled globally by doing a:

echo 1 > /sys/fs/xfs/debug/always_cow

Note that the parameter is global to allow running all tests in xfstests
easily in this mode, which would not easily be possible with a per-fs
sysfs file.

In always_cow mode persistent preallocations are disabled, and fallocate
will fail when called with a 0 mode (with our without
FALLOC_FL_KEEP_SIZE), and not create unwritten extent for zeroed space
when called with FALLOC_FL_ZERO_RANGE or FALLOC_FL_UNSHARE_RANGE.

There are a few interesting xfstests failures when run in always_cow
mode:

- generic/392 fails because the bytes used in the file used to test
hole punch recovery are less after the log replay. This is
because the blocks written and then punched out are only freed
with a delay due to the logging mechanism.
- xfs/170 will fail as the already fragile file streams mechanism
doesn't seem to interact well with the COW allocator
- xfs/180 xfs/182 xfs/192 xfs/198 xfs/204 and xfs/208 will claim
the file system is badly fragmented, but there is not much we
can do to avoid that when always writing out of place
- xfs/205 fails because overwriting a file in always_cow mode
will require new space allocation and the assumption in the
test thus don't work anymore.
- xfs/326 fails to modify the file at all in always_cow mode after
injecting the refcount error, leading to an unexpected md5sum
after the remount, but that again is expected

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# e1f6ca11 14-Feb-2019 Darrick J. Wong <darrick.wong@oracle.com>

xfs: rename m_inotbt_nores to m_finobt_nores

Rename this flag variable to imply more strongly that it's related to
the free inode btree (finobt) operation. No functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>


# 9b247179 07-Feb-2019 Darrick J. Wong <darrick.wong@oracle.com>

xfs: cache unlinked pointers in an rhashtable

Use a rhashtable to cache the unlinked list incore. This should speed
up unlinked processing considerably when there are a lot of inodes on
the unlinked list because iunlink_remove no longer has to traverse an
entire bucket list to find which inode points to the one being removed.

The incore list structure records "X.next_unlinked = Y" relations, with
the rhashtable using Y to index the records. This makes finding the
inode X that points to a inode Y very quick. If our cache fails to find
anything we can always fall back on the old method.

FWIW this drastically reduces the amount of time it takes to remove
inodes from the unlinked list. I wrote a program to open a lot of
O_TMPFILE files and then close them in the same order, which takes
a very long time if we have to traverse the unlinked lists. With the
ptach, I see:

+ /d/t/tmpfile/tmpfile
Opened 193531 files in 6.33s.
Closed 193531 files in 5.86s

real 0m12.192s
user 0m0.064s
sys 0m11.619s
+ cd /
+ umount /mnt

real 0m0.050s
user 0m0.004s
sys 0m0.030s

And without the patch:

+ /d/t/tmpfile/tmpfile
Opened 193588 files in 6.35s.
Closed 193588 files in 751.61s

real 12m38.853s
user 0m0.084s
sys 12m34.470s
+ cd /
+ umount /mnt

real 0m0.086s
user 0m0.000s
sys 0m0.060s

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 355e3532 12-Dec-2018 Omar Sandoval <osandov@fb.com>

xfs: cache minimum realtime summary level

The realtime summary is a two-dimensional array on disk, effectively:

u32 rsum[log2(number of realtime extents) + 1][number of blocks in the bitmap]

rsum[log][bbno] is the number of extents of size 2**log which start in
bitmap block bbno.

xfs_rtallocate_extent_near() uses xfs_rtany_summary() to check whether
rsum[log][bbno] != 0 for any log level. However, the summary array is
stored in row-major order (i.e., like an array in C), so all of these
entries are not adjacent, but rather spread across the entire summary
file. In the worst case (a full bitmap block), xfs_rtany_summary() has
to check every level.

This means that on a moderately-used realtime device, an allocation will
waste a lot of time finding, reading, and releasing buffers for the
realtime summary. In particular, one of our storage services (which runs
on servers with 8 very slow CPUs and 15 8 TB XFS realtime filesystems)
spends almost 5% of its CPU cycles in xfs_rtbuf_get() and
xfs_trans_brelse() called from xfs_rtany_summary().

One solution would be to also store the summary with the dimensions
swapped. However, this would require a disk format change to a very old
component of XFS.

Instead, we can cache the minimum size which contains any extents. We do
so lazily; rather than guaranteeing that the cache contains the precise
minimum, it always contains a loose lower bound which we tighten when we
read or update a summary block. This only uses a few kilobytes of memory
and is already serialized via the realtime bitmap and summary inode
locks, so the cost is minimal. With this change, the same workload only
spends 0.2% of its CPU cycles in the realtime allocator.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# c1b4a321 12-Dec-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: precalculate cluster alignment in inodes and blocks

Store the inode cluster alignment information in units of inodes and
blocks in the mount data so that we don't have to keep recalculating
them.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 83dcdb44 12-Dec-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: precalculate inodes and blocks per inode cluster

Store the number of inodes and blocks per inode cluster in the mount
data so that we don't have to keep recalculating them.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 1c02d502 26-Jul-2018 Eric Sandeen <sandeen@sandeen.net>

xfs: remove deprecated barrier/nobarrier mount

The barrier mount options have been no-ops and deprecated since

4cf4573 xfs: deprecate barrier/nobarrier mount option

i.e. kernel 4.10 / December 2016, with a stated deprecation schedule
after v4.15. Should be fair game to remove them now.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# f467cad9 20-Jul-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: force summary counter recalc at next mount

Use the "bad summary count" mount flag from the previous patch to skip
writing the unmount record to force log recovery at the next mount,
which will recalculate the summary counters for us.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 2e9e6481 19-Jul-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: detect and fix bad summary counts at mount

Filippo Giunchedi complained that xfs doesn't even perform basic sanity
checks of the fs summary counters at mount time. Therefore, recalculate
the summary counters from the AGFs after log recovery if the counts were
bad (or we had to recover the fs). Enhance the recalculation routine to
fail the mount entirely if the new values are also obviously incorrect.

We use a mount state flag to record the "bad summary count" state so
that the (subsequent) online fsck patches can detect subtlely incorrect
counts and set the flag; clear it userspace asks for a repair; or force
a recalculation at the next mount if nobody fixes it by unmount time.

Reported-by: Filippo Giunchedi <fgiunchedi@wikimedia.org>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 9bb54cb5 07-Jun-2018 Dave Chinner <dchinner@redhat.com>

xfs: clean up MIN/MAX

Get rid of the MIN/MAX macros and just use the native min/max macros
directly in the XFS code.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 0b61f8a4 05-Jun-2018 Dave Chinner <dchinner@redhat.com>

xfs: convert to SPDX license tags

Remove the verbose license text from XFS files and replace them
with SPDX tags. This does not change the license of any of the code,
merely refers to the common, up-to-date license files in LICENSES/

This change was mostly scripted. fs/xfs/Makefile and
fs/xfs/libxfs/xfs_fs.h were modified by hand, the rest were detected
and modified by the following command:

for f in `git grep -l "GNU General" fs/xfs/` ; do
echo $f
cat $f | awk -f hdr.awk > $f.new
mv -f $f.new $f
done

And the hdr.awk script that did the modification (including
detecting the difference between GPL-2.0 and GPL-2.0+ licenses)
is as follows:

$ cat hdr.awk
BEGIN {
hdr = 1.0
tag = "GPL-2.0"
str = ""
}

/^ \* This program is free software/ {
hdr = 2.0;
next
}

/any later version./ {
tag = "GPL-2.0+"
next
}

/^ \*\// {
if (hdr > 0.0) {
print "// SPDX-License-Identifier: " tag
print str
print $0
str=""
hdr = 0.0
next
}
print $0
next
}

/^ \* / {
if (hdr > 1.0)
next
if (hdr > 0.0) {
if (str != "")
str = str "\n"
str = str $0
next
}
print $0
next
}

/^ \*/ {
if (hdr > 0.0)
next
print $0
next
}

// {
if (hdr > 0.0) {
if (str != "")
str = str "\n"
str = str $0
next
}
print $0
}

END { }
$

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# a27ba260 15-Mar-2018 Brian Foster <bfoster@redhat.com>

xfs: detect agfl count corruption and reset agfl

The struct xfs_agfl v5 header was originally introduced with
unexpected padding that caused the AGFL to operate with one less
slot than intended. The header has since been packed, but the fix
left an incompatibility for users who upgrade from an old kernel
with the unpacked header to a newer kernel with the packed header
while the AGFL happens to wrap around the end. The newer kernel
recognizes one extra slot at the physical end of the AGFL that the
previous kernel did not. The new kernel will eventually attempt to
allocate a block from that slot, which contains invalid data, and
cause a crash.

This condition can be detected by comparing the active range of the
AGFL to the count. While this detects a padding mismatch, it can
also trigger false positives for unrelated flcount corruption. Since
we cannot distinguish a size mismatch due to padding from unrelated
corruption, we can't trust the AGFL enough to simply repopulate the
empty slot.

Instead, avoid unnecessarily complex detection logic and and use a
solution that can handle any form of flcount corruption that slips
through read verifiers: distrust the entire AGFL and reset it to an
empty state. Any valid blocks within the AGFL are intentionally
leaked. This requires xfs_repair to rectify (which was already
necessary based on the state the AGFL was found in). The reset
mitigates the side effect of the padding mismatch problem from a
filesystem crash to a free space accounting inconsistency. The
generic approach also means that this patch can be safely backported
to kernels with or without a packed struct xfs_agfl.

Check the AGF for an invalid freelist count on initial read from
disk. If detected, set a flag on the xfs_perag to indicate that a
reset is required before the AGFL can be used. In the first
transaction that attempts to use a flagged AGFL, reset it to empty,
warn the user about the inconsistency and allow the freelist fixup
code to repopulate the AGFL with new blocks. The xfs_perag flag is
cleared to eliminate the need for repeated checks on each block
allocation operation.

This allows kernels that include the packing fix commit 96f859d52bcb
("libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct")
to handle older unpacked AGFL formats without a filesystem crash.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by Dave Chiluk <chiluk+linuxxfs@indeed.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 0ab32086 09-Mar-2018 Brian Foster <bfoster@redhat.com>

xfs: account only rmapbt-used blocks against rmapbt perag res

The rmapbt perag metadata reservation reserves blocks for the
reverse mapping btree (rmapbt). Since the rmapbt uses blocks from
the agfl and perag accounting is updated as blocks are allocated
from the allocation btrees, the reservation actually accounts blocks
as they are allocated to (or freed from) the agfl rather than the
rmapbt itself.

While this works for blocks that are eventually used for the rmapbt,
not all agfl blocks are destined for the rmapbt. Blocks that are
allocated to the agfl (and thus "reserved" for the rmapbt) but then
used by another structure leads to a growing inconsistency over time
between the runtime tracking of rmapbt usage vs. actual rmapbt
usage. Since the runtime tracking thinks all agfl blocks are rmapbt
blocks, it essentially believes that less future reservation is
required to satisfy the rmapbt than what is actually necessary.

The inconsistency is rectified across mount cycles because the perag
reservation is initialized based on the actual rmapbt usage at mount
time. The problem, however, is that the excessive drain of the
reservation at runtime opens a window to allocate blocks for other
purposes that might be required for the rmapbt on a subsequent
mount. This problem can be demonstrated by a simple test that runs
an allocation workload to consume agfl blocks over time and then
observe the difference in the agfl reservation requirement across an
unmount/mount cycle:

mount ...: xfs_ag_resv_init: ... resv 3193 ask 3194 len 3194
...
... : xfs_ag_resv_alloc_extent: ... resv 2957 ask 3194 len 1
umount...: xfs_ag_resv_free: ... resv 2956 ask 3194 len 0
mount ...: xfs_ag_resv_init: ... resv 3052 ask 3194 len 3194

As the above tracepoints show, the reservation requirement reduces
from 3194 blocks to 2956 blocks as the workload runs. Without any
other changes in the filesystem, the same reservation requirement
jumps from 2956 to 3052 blocks over a umount/mount cycle.

To address this divergence, update the RMAPBT reservation to account
blocks used for the rmapbt only rather than all blocks filled into
the agfl. This patch makes several high-level changes toward that
end:

1.) Reintroduce an AGFL reservation type to serve as an accounting
no-op for blocks allocated to (or freed from) the AGFL.
2.) Invoke RMAPBT usage accounting from the actual rmapbt block
allocation path rather than the AGFL allocation path.

The first change is required because agfl blocks are considered free
blocks throughout their lifetime. The perag reservation subsystem is
invoked unconditionally by the allocation subsystem, so we need a
way to tell the perag subsystem (via the allocation subsystem) to
not make any accounting changes for blocks filled into the AGFL.

The second change causes the in-core RMAPBT reservation usage
accounting to remain consistent with the on-disk state at all times
and eliminates the risk of leaving the rmapbt reservation
underfilled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 21592863 09-Mar-2018 Brian Foster <bfoster@redhat.com>

xfs: rename agfl perag res type to rmapbt

The AGFL perag reservation type accounts all allocations that feed
into (or are released from) the allocation group free list (agfl).
The purpose of the reservation is to support worst case conditions
for the reverse mapping btree (rmapbt). As such, the agfl
reservation usage accounting only considers rmapbt usage when the
in-core counters are initialized at mount time.

This implementation inconsistency leads to divergence of the in-core
and on-disk usage accounting over time. In preparation to resolve
this inconsistency and adjust the AGFL reservation into an rmapbt
specific reservation, rename the AGFL reservation type and
associated accounting fields to something more rmapbt-specific. Also
fix up a couple tracepoints that incorrectly use the AGFL
reservation type to pass the agfl state of the associated extent
where the raw reservation type is expected.

Note that this patch does not change perag reservation behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 4603fa74 06-Mar-2018 Eric Sandeen <sandeen@redhat.com>

xfs: remove unused m_dmevmask from xfs_mount struct

The dmevmask structure member is a dmapi leftover; it's
set here and there but never actually used. Remove it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# f8c47250 20-Jun-2017 Darrick J. Wong <darrick.wong@oracle.com>

xfs: convert drop_writes to use the errortag mechanism

We now have enhanced error injection that can control the frequency
with which errors happen, so convert drop_writes to use this.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>


# c6840101 20-Jun-2017 Darrick J. Wong <darrick.wong@oracle.com>

xfs: expose errortag knobs via sysfs

Creates a /sys/fs/xfs/$dev/errortag/ directory to control the errortag
values directly. This enables us to control the randomness values,
rather than having to accept the defaults.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>


# 31965ef3 20-Jun-2017 Darrick J. Wong <darrick.wong@oracle.com>

xfs: make errortag a per-mountpoint structure

Remove the xfs_etest structure in favor of a per-mountpoint structure.
This will give us the flexibility to set as many error injection points
as we want, and later enable us to set up sysfs knobs to set the trigger
frequency as we wish. This comes at a cost of higher memory use, but
unti we hit 1024 injection points (we're at 29) or a lot of mounts this
shouldn't be a huge issue.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>


# c8ce540d 16-Jun-2017 Darrick J. Wong <darrick.wong@oracle.com>

xfs: remove double-underscore integer types

This is a purely mechanical patch that removes the private
__{u,}int{8,16,32,64}_t typedefs in favor of using the system
{u,}int{8,16,32,64}_t typedefs. This is the sed script used to perform
the transformation and fix the resulting whitespace and indentation
errors:

s/typedef\t__uint8_t/typedef __uint8_t\t/g
s/typedef\t__uint/typedef __uint/g
s/typedef\t__int\([0-9]*\)_t/typedef int\1_t\t/g
s/__uint8_t\t/__uint8_t\t\t/g
s/__uint/uint/g
s/__int\([0-9]*\)_t\t/__int\1_t\t\t/g
s/__int/int/g
/^typedef.*int[0-9]*_t;$/d

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 4f1adf33 19-Apr-2017 Eric Sandeen <sandeen@redhat.com>

xfs: more do_div cleanups

On some architectures do_div does the pointer compare
trick to make sure that we've sent it an unsigned 64-bit
number. (Why unsigned? I don't know.)

Fix up the few places that squawk about this; in
xfs_bmap_wants_extents() we just used a bare int64_t so change
that to unsigned.

In xfs_adjust_extent_unmap_boundaries() all we wanted was the
mod, and we have an xfs-specific function to handle that w/o
side effects, which includes proper casting for do_div.

In xfs_daddr_to_ag[b]no, we were using the wrong type anyway;
XFS_BB_TO_FSBT returns a block in the filesystem, so use
xfs_rfsblock_t not xfs_daddr_t, and gain the unsignedness
from that type as a bonus.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 696a5620 28-Mar-2017 Brian Foster <bfoster@redhat.com>

xfs: use dedicated log worker wq to avoid deadlock with cil wq

The log covering background task used to be part of the xfssyncd
workqueue. That workqueue was removed as of commit 5889608df ("xfs:
syncd workqueue is no more") and the associated work item scheduled
to the xfs-log wq. The latter is used for log buffer I/O completion.

Since xfs_log_worker() can invoke a log flush, a deadlock is
possible between the xfs-log and xfs-cil workqueues. Consider the
following codepath from xfs_log_worker():

xfs_log_worker()
xfs_log_force()
_xfs_log_force()
xlog_cil_force()
xlog_cil_force_lsn()
xlog_cil_push_now()
flush_work()

The above is in xfs-log wq context and blocked waiting on the
completion of an xfs-cil work item. Concurrently, the cil push in
progress can end up blocked here:

xlog_cil_push_work()
xlog_cil_push()
xlog_write()
xlog_state_get_iclog_space()
xlog_wait(&log->l_flush_wait, ...)

The above is in xfs-cil context waiting on log buffer I/O
completion, which executes in xfs-log wq context. In this scenario
both workqueues are deadlocked waiting on eachother.

Add a new workqueue specifically for the high level log covering and
ail pushing worker, as was the case prior to commit 5889608df.

Diagnosed-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 9dbddd7b 13-Feb-2017 Brian Foster <bfoster@redhat.com>

xfs: resurrect debug mode drop buffered writes mechanism

A debug mode write failure mechanism was introduced to XFS in commit
801cc4e17a ("xfs: debug mode forced buffered write failure") to
facilitate targeted testing of delalloc indirect reservation management
from userspace. This code was subsequently rendered ineffective by the
move to iomap based buffered writes in commit 68a9f5e700 ("xfs:
implement iomap based buffered write path"). This likely went unnoticed
because the associated userspace code had not made it into xfstests.

Resurrect this mechanism to facilitate effective indlen reservation
testing from xfstests. The move to iomap based buffered writes relocated
the hook this mechanism needs to return write failure from XFS to
generic code. The failure trigger must remain in XFS. Given that
limitation, convert this from a write failure mechanism to one that
simply drops writes without returning failure to userspace. Rename all
"fail_writes" references to "drop_writes" to illustrate the point. This
is more hacky than preferred, but still triggers the XFS error handling
behavior required to drive the indlen tests. This is only available in
DEBUG mode and for testing purposes only.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# ebf55872 07-Feb-2017 Christoph Hellwig <hch@lst.de>

xfs: improve handling of busy extents in the low-level allocator

Currently we force the log and simply try again if we hit a busy extent,
but especially with online discard enabled it might take a while after
the log force for the busy extents to disappear, and we might have
already completed our second pass.

So instead we add a new waitqueue and a generation counter to the pag
structure so that we can do wakeups once we've removed busy extents,
and we replace the single retry with an unconditional one - after
all we hold the AGF buffer lock, so no other allocations or frees
can be racing with us in this AG.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 76d771b4 25-Jan-2017 Christoph Hellwig <hch@lst.de>

xfs: use per-AG reservations for the finobt

Currently we try to rely on the global reserved block pool for block
allocations for the free inode btree, but I have customer reports
(fairly complex workload, need to find an easier reproducer) where that
is not enough as the AG where we free an inode that requires a new
finobt block is entirely full. This causes us to cancel a dirty
transaction and thus a file system shutdown.

I think the right way to guard against this is to treat the finot the same
way as the refcount btree and have a per-AG reservations for the possible
worst case size of it, and the patch below implements that.

Note that this could increase mount times with large finobt trees. In
an ideal world we would have added a field for the number of finobt
fields to the AGI, similar to what we did for the refcount blocks.
We should do add it next time we rev the AGI or AGF format by adding
new fields.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 6031e73a 06-Dec-2016 Lucas Stach <dev@lynxeye.de>

xfs: use rhashtable to track buffer cache

On filesystems with a lot of metadata and in metadata intensive workloads
xfs_buf_find() is showing up at the top of the CPU cycles trace. Most of
the CPU time is spent on CPU cache misses while traversing the rbtree.

As the buffer cache does not need any kind of ordering, but fast lookups
a hashtable is the natural data structure to use. The rhashtable
infrastructure provides a self-scaling hashtable implementation and
allows lookups to proceed while the table is going through a resize
operation.

This reduces the CPU-time spent for the lookups to 1/3 even for small
filesystems with a relatively small number of cached buffers, with
possibly much larger gains on higher loaded filesystems.

[dchinner: reduce minimum hash size to an acceptable size for large
filesystems with many AGs with no active use.]
[dchinner: remove stale rbtree asserts.]
[dchinner: use xfs_buf_map for compare function argument.]
[dchinner: make functions static.]
[dchinner: remove redundant comments.]

Signed-off-by: Lucas Stach <dev@lynxeye.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 83104d44 03-Oct-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: garbage collect old cowextsz reservations

Trim CoW reservations made on behalf of a cowextsz hint if they get too
old or we run low on quota, so long as we don't have dirty data awaiting
writeback or directio operations in progress.

Garbage collection of the cowextsize extents are kept separate from
prealloc extent reaping because setting the CoW prealloc lifetime to a
(much) higher value than the regular prealloc extent lifetime has been
useful for combatting CoW fragmentation on VM hosts where the VMs
experience bursty write behaviors and we can keep the utilization ratios
low enough that we don't start to run out of space. IOWs, it benefits
us to keep the CoW fork reservations around for as long as we can unless
we run out of blocks or hit inode reclaim.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 1946b91c 03-Oct-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: define the on-disk refcount btree format

Start constructing the refcount btree implementation by establishing
the on-disk format and everything needed to read, write, and
manipulate the refcount btree blocks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 46eeb521 03-Oct-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: introduce refcount btree definitions

Add new per-AG refcount btree definitions to the per-AG structures.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 3fd129b6 18-Sep-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: set up per-AG free space reservations

One unfortunate quirk of the reference count and reverse mapping
btrees -- they can expand in size when blocks are written to *other*
allocation groups if, say, one large extent becomes a lot of tiny
extents. Since we don't want to start throwing errors in the middle
of CoWing, we need to reserve some blocks to handle future expansion.
The transaction block reservation counters aren't sufficient here
because we have to have a reserve of blocks in every AG, not just
somewhere in the filesystem.

Therefore, create two per-AG block reservation pools. One feeds the
AGFL so that rmapbt expansion always succeeds, and the other feeds all
other metadata so that refcountbt expansion never fails.

Use the count of how many reserved blocks we need to have on hand to
create a virtual reservation in the AG. Through selective clamping of
the maximum length of allocation requests and of the length of the
longest free extent, we can make it look like there's less free space
in the AG unless the reservation owner is asking for blocks.

In other words, play some accounting tricks in-core to make sure that
we always have blocks available. On the plus side, there's nothing to
clean up if we crash, which is contrast to the strategy that the rough
draft used (actually removing extents from the freespace btrees).

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 77169812 13-Sep-2016 Eric Sandeen <sandeen@redhat.com>

xfs: normalize "infinite" retries in error configs

As it stands today, the "fail immediately" vs. "retry forever"
values for max_retries and retry_timeout_seconds in the xfs metadata
error configurations are not consistent.

A retry_timeout_seconds of 0 means "retry forever," but a
max_retries of 0 means "fail immediately."

retry_timeout_seconds < 0 is disallowed, while max_retries == -1
means "retry forever."

Make this consistent across the error configs, such that a value of
0 means "fail immediately" (i.e. wait 0 seconds, or retry 0 times),
and a value of -1 always means "retry forever."

This makes retry_timeout a signed long to accommodate the -1, even
though it stores jiffies. Given our limit of a 1 day maximum
timeout, this should be sufficient even at much higher HZ values
than we have available today.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 52548852 02-Aug-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: rmap btree requires more reserved free space

Originally-From: Dave Chinner <dchinner@redhat.com>

The rmap btree is allocated from the AGFL, which means we have to
ensure ENOSPC is reported to userspace before we run out of free
space in each AG. The last allocation in an AG can cause a full
height rmap btree split, and that means we have to reserve at least
this many blocks *in each AG* to be placed on the AGFL at ENOSPC.
Update the various space calculation functions to handle this.

Also, because the macros are now executing conditional code and are
called quite frequently, convert them to functions that initialise
variables in the struct xfs_mount, use the new variables everywhere
and document the calculations better.

[darrick.wong@oracle.com: don't reserve blocks if !rmap]
[dchinner@redhat.com: update m_ag_max_usable after growfs]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 035e00ac 02-Aug-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: define the on-disk rmap btree format

Originally-From: Dave Chinner <dchinner@redhat.com>

Now we have all the surrounding call infrastructure in place, we can
start filling out the rmap btree implementation. Start with the
on-disk btree format; add everything needed to read, write and
manipulate rmap btree blocks. This prepares the way for adding the
btree operations implementation.

[darrick: record owner and offset info in rmap btree]
[darrick: fork, bmbt and unwritten state in rmap btree]
[darrick: flags are a separate field in xfs_rmap_irec]
[darrick: calculate maxlevels separately]
[darrick: move the 'unwritten' bit into unused parts of rm_offset]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 8018026e 02-Aug-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: rmap btree add more reserved blocks

Originally-From: Dave Chinner <dchinner@redhat.com>

XFS reserves a small amount of space in each AG for the minimum
number of free blocks needed for operation. Adding the rmap btree
increases the number of reserved blocks, but it also increases the
complexity of the calculation as the free inode btree is optional
(like the rmbt).

Rather than calculate the prealloc blocks every time we need to
check it, add a function to calculate it at mount time and store it
in the struct xfs_mount, and convert the XFS_PREALLOC_BLOCKS macro
just to use the xfs-mount variable directly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# e6b3bb78 17-May-2016 Carlos Maiolino <cmaiolino@redhat.com>

xfs: add "fail at unmount" error handling configuration

If we take "retry forever" literally on metadata IO errors, we can
hang at unmount, once it retries those writes forever. This is the
default behavior, unfortunately.

Add an error configuration option for this behavior and default it
to "fail" so that an unmount will trigger actuall errors, a shutdown
and allow the unmount to succeed. It will be noisy, though, as it
will log the errors and shutdown that occurs.

To fix this, we need to mark the filesystem as being in the process
of unmounting. Do this with a mount flag that is added at the
appropriate time (i.e. before the blocking AIL sync). We also need
to add this flag if mount fails after the initial phase of log
recovery has been run.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# e0a431b3 17-May-2016 Carlos Maiolino <cmaiolino@redhat.com>

xfs: add configuration handlers for specific errors

now most of the infrastructure is in place, we can start adding
support for configuring specific errors such as ENODEV, ENOSPC, EIO,
etc. Add these error configurations and configure them all to have
appropriate behaviours. That is, all will be configured to retry
forever by default, except for ENODEV, which is an unrecoverable
error, so it will be configured to not retry on error

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# a5ea70d2 17-May-2016 Carlos Maiolino <cmaiolino@redhat.com>

xfs: add configuration of error failure speed

On reception of an error, we can fail immediately, perform some
bound amount of retries or retry indefinitely. The current behaviour
we have is to retry forever.

However, we'd like the ability to choose how long the filesystem
should try after an error, it can either fail immediately, retry a
few times, or retry forever. This is implemented by using
max_retries sysfs attribute, to hold the amount of times we allow
the filesystem to retry after an error. Being -1 a special case
where the filesystem will retry indefinitely.

Add both a maximum retry count and a retry timeout so that we can
bound by time and/or physical IO attempts.

Finally, plumb these into xfs_buf_iodone error processing so that
the error behaviour follows the selected configuration.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# df309390 17-May-2016 Carlos Maiolino <cmaiolino@redhat.com>

xfs: add configurable error support to metadata buffers

With the error configuration handle for async metadata write errors
in place, we can now add initial support to the IO error processing
in xfs_buf_iodone_error().

Add an infrastructure function to look up the configuration handle,
and rearrange the error handling to prepare the way for different
error handling conigurations to be used.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# ffd40ef6 17-May-2016 Carlos Maiolino <cmaiolino@redhat.com>

xfs: introduce metadata IO error class

Now we have the basic infrastructure, add the first error class so
we can build up the infrastructure in a meaningful way. Add the
metadata async write IO error class and sysfs entry, and introduce a
default configuration that matches the existing "retry forever"
behavior for async write metadata buffers.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 192852be 17-May-2016 Carlos Maiolino <cmaiolino@redhat.com>

xfs: configurable error behavior via sysfs

We need to be able to change the way XFS behaviours in error
conditions depending on the type of underlying storage. This is
necessary for handling non-traditional block devices with extended
error cases, such as thin provisioned devices that can return ENOSPC
as an IO error.

Introduce the basic sysfs infrastructure needed to define and
configure error behaviours. This is done to be generic enough to
extend to configuring behaviour in other error conditions, such as
ENOMEM, which also has different desired behaviours according to
machine configuration.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 09cbfeaf 01-Apr-2016 Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 801cc4e1 14-Mar-2016 Brian Foster <bfoster@redhat.com>

xfs: debug mode forced buffered write failure

Add a DEBUG mode-only sysfs knob to enable forced buffered write
failure. An additional side effect of this mode is brute force killing
of delayed allocation blocks in the range of the write. The latter is
the prime motiviation behind this patch, as userspace test
infrastructure requires a reliable mechanism to create and split
delalloc extents without causing extent conversion.

Certain fallocate operations (i.e., zero range) were used for this in
the past, but the implementations have changed such that delalloc
extents are flushed and converted to real blocks, rendering the test
useless.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 12c3f05c 01-Mar-2016 Eric Sandeen <sandeen@redhat.com>

xfs: fix up inode32/64 (re)mount handling

inode32/inode64 allocator behavior with respect to mount, remount
and growfs is a little tricky.

The inode32 mount option should only enable the inode32 allocator
heuristics if the filesystem is large enough for 64-bit inodes to
exist. Today, it has this behavior on the initial mount, but a
remount with inode32 unconditionally changes the allocation
heuristics, even for a small fs.

Also, an inode32 mounted small filesystem should transition to the
inode32 allocator if the filesystem is subsequently grown to a
sufficient size. Today that does not happen.

This patch consolidates xfs_set_inode32 and xfs_set_inode64 into a
single new function, and moves the "is the maximum inode number big
enough to matter" test into that function, so it doesn't rely on the
caller to get it right - which remount did not do, previously.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# de0b85a8 07-Feb-2016 Eric Sandeen <sandeen@redhat.com>

xfs: remove unused function definitions

Old leftovers.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# af3b6382 02-Nov-2015 Darrick J. Wong <darrick.wong@oracle.com>

xfs: don't leak uuid table on rmmod

Don't leak the UUID table when the module is unloaded.
(Found with kmemleak.)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 3fbbbea3 02-Nov-2015 Dave Chinner <dchinner@redhat.com>

xfs: introduce BMAPI_ZERO for allocating zeroed extents

To enable DAX to do atomic allocation of zeroed extents, we need to
drive the block zeroing deep into the allocator. Because
xfs_bmapi_write() can return merged extents on allocation that were
only partially allocated (i.e. requested range spans allocated and
hole regions, allocation into the hole was contiguous), we cannot
zero the extent returned from xfs_bmapi_write() as that can
overwrite existing data with zeros.

Hence we have to drive the extent zeroing into the allocation code,
prior to where we merge the extents into the BMBT and return the
resultant map. This means we need to propagate this need down to
the xfs_alloc_vextent() and issue the block zeroing at this point.

While this functionality is being introduced for DAX, there is no
reason why it is specific to DAX - we can per-zero blocks during the
allocation transaction on any type of device. It's just slow (and
usually slower than unwritten allocation and conversion) on
traditional block devices so doesn't tend to get used. We can,
however, hook hardware zeroing optimisations via sb_issue_zeroout()
to this operation, so it may be useful in future and hence the
"allocate zeroed blocks" API needs to be implementation neutral.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 225e4635 12-Oct-2015 Bill O'Donnell <billodo@redhat.com>

xfs: per-filesystem stats in sysfs

This patch implements per-filesystem stats objects in sysfs. It
depends on the application of the previous patch series that
develops the infrastructure to support both xfs global stats and
xfs per-fs stats in sysfs.

Stats objects are instantiated when an xfs filesystem is mounted
and deleted on unmount. With this patch, the stats directory is
created and populated with the familiar stats and stats_clear files.
Example:
/sys/fs/xfs/sda9/stats/stats
/sys/fs/xfs/sda9/stats/stats_clear

With this patch, the individual counts within the new per-fs
stats file(s) remain at zero. Functions that use the the macros
to increment, decrement, and add-to the per-fs stats counts will
be covered in a separate new patch to follow this one. Note that
the counts within the global stats file (/sys/fs/xfs/stats/stats)
advance normally and can be cleared as it was prior to this patch.

[dchinner: move setup/teardown to xfs_fs_{fill|put}_super() so
it is down before/after any path that uses the per-mount stats. ]

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# cbe4dab1 03-Jun-2015 Dave Chinner <dchinner@redhat.com>

xfs: add initial DAX support

Add initial DAX support to XFS. To do this we need a new mount
option to turn DAX on filesystem, and we need to propagate this into
the inode flags whenever an inode is instantiated so that the
per-inode checks throughout the code Do The Right Thing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 066a1884 28-May-2015 Brian Foster <bfoster@redhat.com>

xfs: use sparse chunk alignment for min. inode allocation requirement

xfs_ialloc_ag_select() iterates through the allocation groups looking
for free inodes or free space to determine whether to allow an inode
allocation to proceed. If no free inodes are available, it assumes that
an AG must have an extent longer than mp->m_ialloc_blks.

Sparse inode chunk support currently allows for allocations smaller than
the traditional inode chunk size specified in m_ialloc_blks. The current
minimum sparse allocation is set in the superblock sb_spino_align field
at mkfs time. Create a new m_ialloc_min_blks field in xfs_mount and use
this to represent the minimum supported allocation size for inode
chunks. Initialize m_ialloc_min_blks at mount time based on whether
sparse inodes are supported.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 964aa8d9 23-Feb-2015 Dave Chinner <dchinner@redhat.com>

xfs: remove xfs_mod_incore_sb API

Now that there are no users of the bitfield based incore superblock
modification API, just remove the whole damn lot of it, including
all the bitfield definitions. This finally removes a lot of cruft
that has been around for a long time.

Credit goes to Christoph Hellwig for providing a great patch
connecting all the dots to enale us to do this. This patch is
derived from that work.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 0bd5dded 23-Feb-2015 Dave Chinner <dchinner@redhat.com>

xfs: replace xfs_mod_incore_sb_batched

Introduce helper functions for modifying fields in the superblock
into xfs_trans.c, the only caller of xfs_mod_incore_sb_batch(). We
can then use these directly in xfs_trans_unreserve_and_mod_sb() and
so remove another user of the xfs_mode_incore_sb() API without
losing any functionality or scalability of the transaction commit
code..

Based on a patch from Christoph Hellwig.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# bab98bbe 23-Feb-2015 Dave Chinner <dchinner@redhat.com>

xfs: introduce xfs_mod_frextents

Add a new helper to modify the incore counter of free realtime
extents. This matches the helpers used for inode and data block
counters, and removes a significant users of the xfs_mod_incore_sb()
interface.

Based on a patch originally from Christoph Hellwig.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 5681ca40 23-Feb-2015 Dave Chinner <dchinner@redhat.com>

xfs: Remove icsb infrastructure

Now that the in-core superblock infrastructure has been replaced with
generic per-cpu counters, we don't need it anymore. Nuke it from
orbit so we are sure that it won't haunt us again...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 0d485ada 23-Feb-2015 Dave Chinner <david@fromorbit.com>

xfs: use generic percpu counters for free block counter

XFS has hand-rolled per-cpu counters for the superblock since before
there was any generic implementation. The free block counter is
special in that it is used for ENOSPC detection outside transaction
contexts for for delayed allocation. This means that the counter
needs to be accurate at zero. The current per-cpu counter code jumps
through lots of hoops to ensure we never run past zero, but we don't
need to make all those jumps with the generic counter
implementation.

The generic counter implementation allows us to pass a "batch"
threshold at which the addition/subtraction to the counter value
will be folded back into global value under lock. We can use this
feature to reduce the batch size as we approach 0 in a very similar
manner to the existing counters and their rebalance algorithm. If we
use a batch size of 1 as we approach 0, then every addition and
subtraction will be done against the global value and hence allow
accurate detection of zero threshold crossing.

Hence we can replace the handrolled, accurate-at-zero counters with
generic percpu counters.

Note: this removes just enough of the icsb infrastructure to compile
without warnings. The rest will go in subsequent commits.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# e88b64ea 23-Feb-2015 Dave Chinner <dchinner@redhat.com>

xfs: use generic percpu counters for free inode counter

XFS has hand-rolled per-cpu counters for the superblock since before
there was any generic implementation. The free inode counter is not
used for any limit enforcement - the per-AG free inode counters are
used during allocation to determine if there are inode available for
allocation.

Hence we don't need any of the complexity of the hand-rolled
counters and we can simply replace them with generic per-cpu
counters similar to the inode counter.

This version introduces a xfs_mod_ifree() helper function from
Christoph Hellwig.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 501ab323 23-Feb-2015 Dave Chinner <david@fromorbit.com>

xfs: use generic percpu counters for inode counter

XFS has hand-rolled per-cpu counters for the superblock since before
there was any generic implementation. There are some warts around
the use of them for the inode counter as the hand rolled counter is
designed to be accurate at zero, but has no specific accurracy at
any other value. This design causes problems for the maximum inode
count threshold enforcement, as there is no trigger that balances
the counters as they get close tothe maximum threshold.

Instead of designing new triggers for balancing, just replace the
handrolled per-cpu counter with a generic counter. This enables us
to update the counter through the normal superblock modification
funtions, but rather than do that we add a xfs_mod_icount() helper
function (from Christoph Hellwig) and keep the percpu counter
outside the superblock in the struct xfs_mount.

This means we still need to initialise the per-cpu counter
specifically when we read the superblock, and vice versa when we
log/write it, but it does mean that we don't need to change any
other code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 52785112 15-Feb-2015 Christoph Hellwig <hch@lst.de>

xfs: implement pNFS export operations

Add operations to export pNFS block layouts from an XFS filesystem. See
the previous commit adding the operations for an explanation of them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 61e63ecb 21-Jan-2015 Dave Chinner <dchinner@redhat.com>

xfs: consolidate superblock logging functions

We now have several superblock loggin functions that are identical
except for the transaction reservation and whether it shoul dbe a
synchronous transaction or not. Consolidate these all into a single
function, a single reserveration and a sync flag and call it
xfs_sync_sb().

Also, xfs_mod_sb() is not really a modification function - it's the
operation of logging the superblock buffer. hence change the name of
it to reflect this.

Note that we have to change the mp->m_update_flags that are passed
around at mount time to a boolean simply to indicate a superblock
update is needed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 4d11a402 21-Jan-2015 Dave Chinner <dchinner@redhat.com>

xfs: remove bitfield based superblock updates

When we log changes to the superblock, we first have to write them
to the on-disk buffer, and then log that. Right now we have a
complex bitfield based arrangement to only write the modified field
to the buffer before we log it.

This used to be necessary as a performance optimisation because we
logged the superblock buffer in every extent or inode allocation or
freeing, and so performance was extremely important. We haven't done
this for years, however, ever since the lazy superblock counters
pulled the superblock logging out of the transaction commit
fast path.

Hence we have a bunch of complexity that is not necessary that makes
writing the in-core superblock to disk much more complex than it
needs to be. We only need to log the superblock now during
management operations (e.g. during mount, unmount or quota control
operations) so it is not a performance critical path anymore.

As such, remove the complex field based logging mechanism and
replace it with a simple conversion function similar to what we use
for all other on-disk structures.

This means we always log the entirity of the superblock, but again
because we rarely modify the superblock this is not an issue for log
bandwidth or CPU time. Indeed, if we do log the superblock
frequently, delayed logging will minimise the impact of this
overhead.

[Fixed gquota/pquota inode sharing regression noticed by bfoster.]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 4fb6e8ad 27-Nov-2014 Christoph Hellwig <hch@lst.de>

xfs: merge xfs_ag.h into xfs_format.h

More on-disk format consolidation. A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 91ee575f 27-Nov-2014 Brian Foster <bfoster@redhat.com>

xfs: allow lazy sb counter sync during filesystem freeze sequence

The expectation since the introduction the lazy superblock counters is
that the counters are synced and superblock logged appropriately as part
of the filesystem freeze sequence. This does not occur, however, due to
the logic in xfs_fs_writable() that prevents progress when the fs is in
any state other than SB_UNFROZEN.

While this is a bug, it has not been exposed to date because the last
thing XFS does during freeze is dirty the log. The log recovery process
recalculates the counters from AGI/AGF metadata to ensure everything is
correct. Therefore should a crash occur while an fs is frozen, the
subsequent log recovery puts everything back in order. See the following
commit for reference:

92821e2b [XFS] Lazy Superblock Counters

We might not always want to rely on dirtying the log on a frozen fs.
Modify xfs_log_sbcount() to proceed when the filesystem is freezing but
not once the freeze process has completed. Modify xfs_fs_writable() to
accept the minimum freeze level for which modifications should be
blocked to support various codepaths.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 78c931b8 27-Nov-2014 Brian Foster <bfoster@redhat.com>

xfs: replace global xfslogd wq with per-mount wq

The xfslogd workqueue is a global, single-job workqueue for buffer ioend
processing. This means we allow for a single work item at a time for all
possible XFS mounts on a system. fsstress testing in loopback XFS over
XFS configurations has reproduced xfslogd deadlocks due to the single
threaded nature of the queue and dependencies introduced between the
separate XFS instances by online discard (-o discard).

Discard over a loopback device converts the discard request to a hole
punch (fallocate) on the underlying file. Online discard requests are
issued synchronously and from xfslogd context in XFS, hence the xfslogd
workqueue is blocked in the upper fs waiting on a hole punch request to
be servied in the lower fs. If the lower fs issues I/O that depends on
xfslogd to complete, both filesystems end up hung indefinitely. This is
reproduced reliabily by generic/013 on XFS->loop->XFS test devices with
the '-o discard' mount option.

Further, docker implementations appear to use this kind of configuration
for container instance filesystems by default (container fs->dm->
loop->base fs) and therefore are subject to this deadlock when running
on XFS.

Replace the global xfslogd workqueue with a per-mount variant. This
guarantees each mount access to a single worker and prevents deadlocks
due to inter-fs dependencies introduced by discard. Since the queue is
only responsible for buffer iodone processing at this point in time,
rename xfslogd to xfs-buf.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# a31b1d3d 14-Jul-2014 Brian Foster <bfoster@redhat.com>

xfs: add xfs_mount sysfs kobject

Embed a base kobject into xfs_mount. This creates a kobject associated
with each XFS mount and a subdirectory in sysfs with the name of the
filesystem. The subdirectory lifecycle matches that of the mount. Also
add the new xfs_sysfs.[c,h] source files with some XFS sysfs
infrastructure to facilitate attribute creation.

Note that there are currently no attributes exported as part of the
xfs_mount kobject. It exists solely to serve as a per-mount container
for child objects.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 7ab610f9 05-Jun-2014 Dave Chinner <dchinner@redhat.com>

xfs: move node entry counts to xfs_da_geometry

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# ed358c00 05-Jun-2014 Dave Chinner <dchinner@redhat.com>

xfs: convert dir/attr btree threshold to xfs_da_geometry

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 8f66193c 05-Jun-2014 Dave Chinner <dchinner@redhat.com>

xfs: convert m_dirblksize to xfs_da_geometry

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# d6cf1305 05-Jun-2014 Dave Chinner <dchinner@redhat.com>

xfs: convert m_dirblkfsbs to xfs_da_geometry

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 7dda6e86 05-Jun-2014 Dave Chinner <dchinner@redhat.com>

xfs: convert directory segment limits to xfs_da_geometry

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 0650b554 05-Jun-2014 Dave Chinner <dchinner@redhat.com>

xfs: introduce directory geometry structure

The directory code has a dependency on the struct xfs_mount to
supply the directory block geometry. Block size, block log size,
and other parameters are pre-caclulated in the struct xfs_mount or
access directly from the superblock embedded in the struct
xfs_mount.

Extract all of this geometry information out of the struct xfs_mount
and superblock and place it into a new struct xfs_da_geometry
defined by the directory code. Allocate and initialise it at mount
time, and attach it to the struct xfs_mount so it canbe passed back
into the directory code appropriately rather than using the struct
xfs_mount.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 8f80587b 31-Oct-2013 Dave Chinner <dchinner@redhat.com>

xfs: increase inode cluster size for v5 filesystems

v5 filesystems use 512 byte inodes as a minimum, so read inodes in
clusters that are effectively half the size of a v4 filesystem with
256 byte inodes. For v5 fielsystems, scale the inode cluster size
with the size of the inode so that we keep a constant 32 inodes per
cluster ratio for all inode IO.

This only works if mkfs.xfs sets the inode alignment appropriately
for larger inode clusters, so this functionality is made conditional
on mkfs doing the right thing. xfs_repair needs to know about
the inode alignment changes, too.

Wall time:
create bulkstat find+stat ls -R unlink
v4 237s 161s 173s 201s 299s
v5 235s 163s 205s 31s 356s
patched 234s 160s 182s 29s 317s

System time:
create bulkstat find+stat ls -R unlink
v4 2601s 2490s 1653s 1656s 2960s
v5 2637s 2497s 1681s 20s 3216s
patched 2613s 2451s 1658s 20s 3007s

So, wall time same or down across the board, system time same or
down across the board, and cache hit rates all improve except for
the ls -R case which is a pure cold cache directory read workload
on v5 filesystems...

So, this patch removes most of the performance and CPU usage
differential between v4 and v5 filesystems on traversal related
workloads.

Note: while this patch is currently for v5 filesystems only, there
is no reason it can't be ported back to v4 filesystems. This hasn't
been done here because bringing the code back to v4 requires
forwards and backwards kernel compatibility testing. i.e. to
deterine if older kernels(*) do the right thing with larger inode
alignments but still only using 8k inode cluster sizes. None of this
testing and validation on v4 filesystems has been done, so for the
moment larger inode clusters is limited to v5 superblocks.

(*) a current default config v4 filesystem should mount just fine on
2.6.23 (when lazy-count support was introduced), and so if we change
the alignment emitted by mkfs without a feature bit then we have to
make sure it works properly on all kernels since 2.6.23. And if we
allow it to be changed when the lazy-count bit is not set, then it's
all kernels since v2 logs were introduced that need to be tested for
compatibility...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 4bceb18f 29-Oct-2013 Dave Chinner <david@fromorbit.com>

xfs: vectorise DA btree operations

The remaining non-vectorised code for the directory structure is the
node format blocks. This is shared with the attribute tree, and so
is slightly more complex to vectorise.

Introduce a "non-directory" directory ops structure that is attached
to all non-directory inodes so that attribute operations can be
vectorised for all inodes.

Once we do this, we can vectorise all the da btree operations.
Because this patch adds more infrastructure than it removes the
binary size does not decrease:

text data bss dec hex filename
794490 96802 1096 892388 d9de4 fs/xfs/xfs.o.orig
792986 96802 1096 890884 d9804 fs/xfs/xfs.o.p1
792350 96802 1096 890248 d9588 fs/xfs/xfs.o.p2
789293 96802 1096 887191 d8997 fs/xfs/xfs.o.p3
789005 96802 1096 886903 d8997 fs/xfs/xfs.o.p4
789061 96802 1096 886959 d88af fs/xfs/xfs.o.p5
789733 96802 1096 887631 d8b4f fs/xfs/xfs.o.p6

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 32c5483a 29-Oct-2013 Dave Chinner <dchinner@redhat.com>

xfs: abstract the differences in dir2/dir3 via an ops vector

Lots of the dir code now goes through switches to determine what is
the correct on-disk format to parse. It generally involves a
"xfs_sbversion_hasfoo" check, deferencing the superblock version and
feature fields and hence touching several cache lines per operation
in the process. Some operations do multiple checks because they nest
conditional operations and they don't pass the information in a
direct fashion between each other.

Hence, add an ops vector to the xfs_inode structure that is
configured when the inode is initialised to point to all the correct
decode and encoding operations. This will significantly reduce the
branchiness and cacheline footprint of the directory object decoding
and encoding.

This is the first patch in a series of conversion patches. It will
introduce the ops structure, the setup of it and add the first
operation to the vector. Subsequent patches will convert directory
ops one at a time to keep the changes simple and obvious.

Just this patch shows the benefit of such an approach on code size.
Just converting the two shortform dir operations as this patch does
decreases the built binary size by ~1500 bytes:

$ size fs/xfs/xfs.o.orig fs/xfs/xfs.o.p1
text data bss dec hex filename
794490 96802 1096 892388 d9de4 fs/xfs/xfs.o.orig
792986 96802 1096 890884 d9804 fs/xfs/xfs.o.p1
$

That's a significant decrease in the instruction cache footprint of
the directory code for such a simple change, and indicates that this
approach is definitely worth pursuing further.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 0eadd102 12-Aug-2013 Jie Liu <jeff.liu@oracle.com>

xfs: Introduce a new structure to hold transaction reservation items

Introduce a new structure xfs_trans_res to hold transaction
reservation item info per log ticket.

We also need to improve xfs_trans_resv_calc() by initializing the
log count as well as log flags for permanent log reservation.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 9356fe22 12-Aug-2013 Dave Chinner <dchinner@redhat.com>

xfs: make struct xfs_perag kernel only

The struct xfs_perag has many kernel-only definitions in it,
requiring a __KERNEL__ guard so userspace can use it to. Move it to
xfs_mount.h so that it it kernel-only, and let userspace redefine
it's own version of the structure containing only what it needs.
This gets rid of another __KERNEL__ check in the XFS header files.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# ff55068c 12-Aug-2013 Dave Chinner <dchinner@redhat.com>

xfs: introduce xfs_sb.c for sharing with libxfs

xfs_mount.c is shared with userspace, but the only functions that
are shared are to do with physical superblock manipulations. This
means that less than 25% of the xfs_mount.c code is actually shared
with userspace. Move all the superblock functions to xfs_sb.c and
share that instead with libxfs.

Note that this will leave all the in-core transaction related
superblock counter modifications in xfs_mount.c as none of that is
shared with userspace. With a few more small changes, xfs_mount.h
won't need to be shared with userspace anymore, either.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 7fd36c44 12-Aug-2013 Dave Chinner <dchinner@redhat.com>

xfs: split out transaction reservation code

The transaction reservation size calculations is used by both kernel
and userspace, but most of the transaction code in xfs_trans.c is
kernel specific. Split all the transaction reservation code out into
it's own files to make sharing with userspace simpler. This just
leaves kernel-only definitions in xfs_trans.h, so it doesn't need to
be shared with userspace anymore, either.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 39a45d84 02-May-2013 Jie Liu <jeff.liu@oracle.com>

xfs: Remove XFS_MOUNT_RETERR

XFS_MOUNT_RETERR is going to be set at xfs_parseargs() if
mp->m_dalign is enabled, so any time we enter "if (mp->m_dalign)"
branch in xfs_update_alignment(), XFS_MOUNT_RETERR is set and so
we always be emitting a warning and returning an error.

Hence, we can remove it and get rid of a couple of redundant
check up against it at xfs_upate_alignment().

Thanks Dave Chinner for the suggestions of simplify the code
in xfs_parseargs().

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 1ebdf361 02-May-2013 Jeff Liu <jeff.liu@oracle.com>

xfs: Remove struct xfs_chash from xfs_mount

Remove struct xfs_chash from struct xfs_mount as there is no user of
it nowadays.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 04a1e6c5 02-Apr-2013 Dave Chinner <dchinner@redhat.com>

xfs: add CRC checks to the superblock

With the addition of CRCs, there is such a wide and varied change to
the on disk format that it makes sense to bump the superblock
version number rather than try to use feature bits for all the new
functionality.

This commit introduces all the new superblock fields needed for all
the new functionality: feature masks similar to ext4, separate
project quota inodes, a LSN field for recovery and the CRC field.

This commit does not bump the superblock version number, however.
That will be done as a separate commit at the end of the series
after all the new functionality is present so we switch it all on in
one commit. This means that we can slowly introduce the changes
without them being active and hence maintain bisectability of the
tree.

This patch is based on a patch originally written by myself back
from SGI days, which was subsequently modified by Christoph Hellwig.
There is relatively little of that patch remaining, but the history
of the patch still should be acknowledged here.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# d8ddfe81 11-Mar-2013 Jeff Liu <jeff.liu@oracle.com>

xfs: Remove obsoleted m_inode_shrink from xfs_mount structure

Looks the old m_inode_shrink is obsoleted as we perform inodes reclaim per AG via
m_reclaim_workqueue, this patch remove it from the xfs_mount structure if so.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# a21cd503 28-Jan-2013 Jeff Liu <jeff.liu@oracle.com>

xfs: refactor space log reservation for XFS_TRANS_ATTR_SET

Currently, we calculate the attribute set transaction
log space reservation at runtime in two parts:

1) XFS_ATTRSET_LOG_RES() which is calcuated out at mount time.

2) ((ext * (mp)->m_sb.sb_sectsize) + \
(ext * XFS_FSB_TO_B((mp), XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK))) + \
(128 * (ext + (ext * XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK))))))
which is calculated out at runtime since it depend on the given extent length in blocks.

This patch renamed XFS_ATTRSET_LOG_RES(mp) to XFS_ATTRSETM_LOG_RES(mp) to indicate
that it is figured out at mount time. Introduce XFS_ATTRSETRT_LOG_RES(mp) which would
be used to calculate out the unit of the log space reservation for one block.

In this way, the total runtime space for the given extent length can be figured out by:
XFS_ATTRSETM_LOG_RES(mp) + XFS_ATTRSETRT_LOG_RES(mp) * ext

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# a7bd794a 28-Jan-2013 Jeff Liu <jeff.liu@oracle.com>

xfs: introduce XFS_SB_LOG_RES() for transactions that modify sb on disk

Introduce a new transaction space reservation XFS_SB_LOG_RES() for
those transactions that need to modify the superblock on disk.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 762d7ba6 28-Jan-2013 Jeff Liu <jeff.liu@oracle.com>

xfs: calculate XFS_TRANS_QM_QUOTAOFF_END space log reservation at mount time

Convert the calculation for end of quotaoff log space reservation
from runtime to mount time.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# a1bd9557 28-Jan-2013 Jeff Liu <jeff.liu@oracle.com>

xfs: calculate XFS_TRANS_QM_QUOTAOFF space log reservation at mount time

Convert the calculation of quota off transaction log space reservation
from runtime to mount time.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 48001044 28-Jan-2013 Jeff Liu <jeff.liu@oracle.com>

xfs: calculate XFS_TRANS_QM_DQALLOC space log reservation at mount time

The disk quota allocation log space reservation is calcuated at runtime,
this patch does it at mount time.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# f0f2df94 28-Jan-2013 Jeff Liu <jeff.liu@oracle.com>

xfs: calcuate XFS_TRANS_QM_SETQLIM space log reservation at mount time

For adjusting quota limits transactions, we calculate out the log space
reservation at runtime, this patch does it at mount time.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# b0c10b98 28-Jan-2013 Jeff Liu <jeff.liu@oracle.com>

xfs: calculate XFS_TRANS_QM_SBCHANGE space log reservation at mount time

The transaction log space for clearing/reseting the quota flags
is calculated out at runtime, this patch can figure it out at
mount time.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 1813dd64 13-Nov-2012 Dave Chinner <dchinner@redhat.com>

xfs: convert buffer verifiers to an ops structure.

To separate the verifiers from iodone functions and associate read
and write verifiers at the same time, introduce a buffer verifier
operations structure to the xfs_buf.

This avoids the need for assigning the write verifier, clearing the
iodone function and re-running ioend processing in the read
verifier, and gets rid of the nasty "b_pre_io" name for the write
verifier function pointer. If we ever need to, it will also be
easier to add further content specific callbacks to a buffer with an
ops structure in place.

We also avoid needing to export verifier functions, instead we
can simply export the ops structures for those that are needed
outside the function they are defined in.

This patch also fixes a directory block readahead verifier issue
it exposed.

This patch also adds ops callbacks to the inode/alloc btree blocks
initialised by growfs. These will need more work before they will
work with CRCs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Phil White <pwhite@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# b0f539de 13-Nov-2012 Dave Chinner <dchinner@redhat.com>

xfs: connect up write verifiers to new buffers

Metadata buffers that are read from disk have write verifiers
already attached to them, but newly allocated buffers do not. Add
appropriate write verifiers to all new metadata buffers.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 98021821 12-Nov-2012 Dave Chinner <dchinner@redhat.com>

xfs: verify superblocks as they are read from disk

Add a superblock verify callback function and pass it into the
buffer read functions. Remove the now redundant verification code
that is currently in use.

Adding verification shows that secondary superblocks never have
their "sb_inprogress" flag cleared by mkfs.xfs, so when validating
the secondary superblocks during a grow operation we have to avoid
checking this field. Even if we fix mkfs, we will still have to
ignore this field for verification purposes unless a version of mkfs
that does not have this bug was used.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Phil White <pwhite@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 579b62fa 06-Nov-2012 Brian Foster <bfoster@redhat.com>

xfs: add background scanning to clear eofblocks inodes

Create a new mount workqueue and delayed_work to enable background
scanning and freeing of eofblocks inodes. The scanner kicks in once
speculative preallocation occurs and stops requeueing itself when
no eofblocks inodes exist.

The scan interval is based on the new
'speculative_prealloc_lifetime' tunable (default to 5m). The
background scanner performs unfiltered, best effort scans (which
skips inodes under lock contention or with a dirty cache mapping).

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 6d8b79cf 08-Oct-2012 Dave Chinner <dchinner@redhat.com>

xfs: rename xfs_sync.[ch] to xfs_icache.[ch]

xfs_sync.c now only contains inode reclaim functions and inode cache
iteration functions. It is not related to sync operations anymore.
Rename to xfs_icache.c to reflect it's contents and prepare for
consolidation with the other inode cache file that exists
(xfs_iget.c).

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 5889608d 08-Oct-2012 Dave Chinner <dchinner@redhat.com>

xfs: syncd workqueue is no more

With the syncd functions moved to the log and/or removed, the syncd
workqueue is the only remaining bit left. It is used by the log
covering/ail pushing work, as well as by the inode reclaim work.

Given how cheap workqueues are these days, give the log and inode
reclaim work their own work queues and kill the syncd work queue.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 9aa05000 08-Oct-2012 Dave Chinner <dchinner@redhat.com>

xfs: xfs_sync_data is redundant.

We don't do any data writeback from XFS any more - the VFS is
completely responsible for that, including for freeze. We can
replace the remaining caller with a VFS level function that
achieves the same thing, but without conflicting with current
writeback work.

This means we can remove the flush_work and xfs_flush_inodes() - the
VFS functionality completely replaces the internal flush queue for
doing this writeback work in a separate context to avoid stack
overruns.

This does have one complication - it cannot be called with page
locks held. Hence move the flushing of delalloc space when ENOSPC
occurs back up into xfs_file_aio_buffered_write when we don't hold
any locks that will stall writeback.

Unfortunately, writeback_inodes_sb_if_idle() is not sufficient to
trigger delalloc conversion fast enough to prevent spurious ENOSPC
whent here are hundreds of writers, thousands of small files and GBs
of free RAM. Hence we need to use sync_sb_inodes() to block callers
while we wait for writeback like the previous xfs_flush_inodes
implementation did.

That means we have to hold the s_umount lock here, but because this
call can nest inside i_mutex (the parent directory in the create
case, held by the VFS), we have to use down_read_trylock() to avoid
potential deadlocks. In practice, this trylock will succeed on
almost every attempt as unmount/remount type operations are
exceedingly rare.

Note: we always need to pass a count of zero to
generic_file_buffered_write() as the previously written byte count.
We only do this by accident before this patch by the virtue of ret
always being zero when there are no errors. Make this explicit
rather than needing to specifically zero ret in the ENOSPC retry
case.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>


# f661f1e0 08-Oct-2012 Dave Chinner <dchinner@redhat.com>

xfs: sync work is now only periodic log work

The only thing the periodic sync work does now is flush the AIL and
idle the log. These are really functions of the log code, so move
the work to xfs_log.c and rename it appropriately.

The only wart that this leaves behind is the xfssyncd_centisecs
sysctl, otherwise the xfssyncd is dead. Clean up any comments that
related to xfssyncd to reflect it's passing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 1ed845df 01-Aug-2012 Alex Elder <elder@inktank.com>

xfs: kill struct declarations in xfs_mount.h

I noticed that "struct xfs_mount_args" was still declared in
"fs/xfs/xfs_mount.h". That struct doesn't even exist any more (and
is obviously not referenced elsewhere in that header file). While
in there, delete four other unneeded struct declarations in that
file.

Doing so highlights that "fs/xfs/xfs_trace.h" was relying indirectly
on "xfs_mount.h" to be #included in order to declare "struct
xfs_bmbt_irec", so add that declaration to resolve that issue.

Signed-off-by: Alex Elder <elder@inktank.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# d9457dc0 12-Jun-2012 Jan Kara <jack@suse.cz>

xfs: Convert to new freezing code

Generic code now blocks all writers from standard write paths. So we add
blocking of all writers coming from ioctl (we get a protection of ioctl against
racing remount read-only as a bonus) and convert xfs_file_aio_write() to a
non-racy freeze protection. We also keep freeze protection on transaction
start to block internal filesystem writes such as removal of preallocated
blocks.

CC: Ben Myers <bpm@sgi.com>
CC: Alex Elder <elder@kernel.org>
CC: xfs@oss.sgi.com
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# f7bdf03a 14-Jun-2012 Mark Tinguely <tinguely@sgi.com>

xfs: rename log structure to xlog

Rename the XFS log structure to xlog to help crash distinquish it from the
other logs in Linux.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>


# ad223e60 14-Jun-2012 Mark Tinguely <tinguely@sgi.com>

xfs: rename log structure to xlog

Rename the XFS log structure to xlog to help crash distinquish it from the
other logs in Linux.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 32972383 07-Jun-2012 Dave Chinner <dchinner@redhat.com>

xfs: make largest supported offset less shouty

XFS_MAXIOFFSET() is just a simple macro that resolves to
mp->m_maxioffset. It doesn't need to exist, and it just makes the
code unnecessarily loud and shouty.

Make it quiet and easy to read.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# d2c28191 07-Jun-2012 Dave Chinner <dchinner@redhat.com>

xfs: m_maxioffset is redundant

The m_maxioffset field in the struct xfs_mount contains the same
value as the superblock s_maxbytes field. There is no need to carry
two copies of this limit around, so use the VFS superblock version.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 4c2d542f 23-Apr-2012 Dave Chinner <david@fromorbit.com>

xfs: Do background CIL flushes via a workqueue

Doing background CIL flushes adds significant latency to whatever
async transaction that triggers it. To avoid blocking async
transactions on things like waiting for log buffer IO to complete,
move the CIL push off into a workqueue. By moving the push work
into a workqueue, we remove all the latency that the commit adds
from the foreground transaction commit path. This also means that
single threaded workloads won't do the CIL push procssing, leaving
them more CPU to do more async transactions.

To do this, we need to keep track of the sequence number we have
pushed work for. This avoids having many transaction commits
attempting to schedule work for the same sequence, and ensures that
we only ever have one push (background or forced) in progress at a
time. It also means that we don't need to take the CIL lock in write
mode to check for potential background push races, which reduces
lock contention.

To avoid potential issues with "smart" IO schedulers, don't use the
workqueue for log force triggered flushes. Instead, do them directly
so that the log IO is done directly by the process issuing the log
force and so doesn't get stuck on IO elevator queue idling
incorrectly delaying the log IO from the workqueue.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 211e4d43 22-Apr-2012 Christoph Hellwig <hch@infradead.org>

xfs: implement freezing by emptying the AIL

Now that we write back all metadata either synchronously or through
the AIL we can simply implement metadata freezing in terms of
emptying the AIL.

The implementation for this is fairly simply and straight-forward:
A new routine is added that asks the xfsaild to push the AIL to the
end and waits for it to complete and send a wakeup. The routine will
then loop if the AIL is not actually empty, and continue to do so
until the AIL is compeltely empty.

We keep an inode reclaim pass in the freeze process to avoid having
memory pressure have to reclaim inodes that require dirtying the
filesystem to be reclaimed after the freeze has completed. This
means we can also treat unmount in the exact same way as freeze.

As an upside we can now remove the radix tree based inode writeback
and xfs_unmountfs_writesb.

[ Dave Chinner:
- Cleaned up commit message.
- Added inode reclaim passes back into freeze.
- Cleaned up wakeup mechanism to avoid the use of a new
sleep counter variable. ]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# aa6bf01d 29-Feb-2012 Christoph Hellwig <hch@infradead.org>

xfs: use per-filesystem I/O completion workqueues

The new concurrency managed workqueues are cheap enough that we can create
per-filesystem instead of global workqueues. This allows us to remove the
trylock or defer scheme on the ilock, which is not helpful once we have
outstanding log reservations until finishing a size update.

Also allow the default concurrency on this workqueues so that I/O completions
blocking on the ilock for one inode do not block process for another inode.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 6bd92a23 23-Jan-2012 Chandra Seetharaman <sekharan@us.ibm.com>

Change xfs_sb_from_disk() interface to take a mount pointer

Change xfs_sb_from_disk() interface to take a mount pointer
instead of a superblock pointer.

This is to print mount point specific error messages in future
fixes.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>


# 93b8a585 06-Dec-2011 Christoph Hellwig <hch@infradead.org>

xfs: remove the deprecated nodelaylog option

The delaylog mode has been the default for a long time, and the nodelaylog
option has been scheduled for removal in Linux 3.3. Remove it and code
only used by it now that we have opened the 3.3 window.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>


# adab0f67 29-Jun-2011 Chandra Seetharaman <sekharan@us.ibm.com>

xfs: Remove the second parameter to xfs_sb_count()

Remove the second parameter to xfs_sb_count() since all callers of
the function set them.

Also, fix the header comment regarding it being called periodically.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Alex Elder <aelder@sgi.com>


# e84661aa 20-May-2011 Christoph Hellwig <hch@infradead.org>

xfs: add online discard support

Now that we have reliably tracking of deleted extents in a
transaction we can easily implement "online" discard support
which calls blkdev_issue_discard once a transaction commits.

The actual discard is a two stage operation as we first have
to mark the busy extent as not available for reuse before we
can start the actual discard. Note that we don't bother
supporting discard for the non-delaylog mode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>


# a7b339f1 07-Apr-2011 Dave Chinner <dchinner@redhat.com>

xfs: introduce background inode reclaim work

Background inode reclaim needs to run more frequently that the XFS
syncd work is run as 30s is too long between optimal reclaim runs.
Add a new periodic work item to the xfs syncd workqueue to run a
fast, non-blocking inode reclaim scan.

Background inode reclaim is kicked by the act of marking inodes for
reclaim. When an AG is first marked as having reclaimable inodes,
the background reclaim work is kicked. It will continue to run
periodically untill it detects that there are no more reclaimable
inodes. It will be kicked again when the first inode is queued for
reclaim.

To ensure shrinker based inode reclaim throttles to the inode
cleaning and reclaim rate but still reclaim inodes efficiently, make it kick the
background inode reclaim so that when we are low on memory we are
trying to reclaim inodes as efficiently as possible. This kick shoul
d not be necessary, but it will protect against failures to kick the
background reclaim when inodes are first dirtied.

To provide the rate throttling, make the shrinker pass do
synchronous inode reclaim so that it blocks on inodes under IO. This
means that the shrinker will reclaim inodes rather than just
skipping over them, but it does not adversely affect the rate of
reclaim because most dirty inodes are already under IO due to the
background reclaim work the shrinker kicked.

These two modifications solve one of the two OOM killer invocations
Chris Mason reported recently when running a stress testing script.
The particular workload trigger for the OOM killer invocation is
where there are more threads than CPUs all unlinking files in an
extremely memory constrained environment. Unlike other solutions,
this one does not have a performance impact on performance when
memory is not constrained or the number of concurrent threads
operating is <= to the number of CPUs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>


# 89e4cb55 07-Apr-2011 Dave Chinner <dchinner@redhat.com>

xfs: convert ENOSPC inode flushing to use new syncd workqueue

On of the problems with the current inode flush at ENOSPC is that we
queue a flush per ENOSPC event, regardless of how many are already
queued. Thi can result in hundreds of queued flushes, most of
which simply burn CPU scanned and do no real work. This simply slows
down allocation at ENOSPC.

We really only need one active flush at a time, and we can easily
implement that via the new xfs_syncd_wq. All we need to do is queue
a flush if one is not already active, then block waiting for the
currently active flush to complete. The result is that we only ever
have a single ENOSPC inode flush active at a time and this greatly
reduces the overhead of ENOSPC processing.

On my 2p test machine, this results in tests exercising ENOSPC
conditions running significantly faster - 042 halves execution time,
083 drops from 60s to 5s, etc - while not introducing test
regressions.

This allows us to remove the old xfssyncd threads and infrastructure
as they are no longer used.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>


# c6d09b66 07-Apr-2011 Dave Chinner <dchinner@redhat.com>

xfs: introduce a xfssyncd workqueue

All of the work xfssyncd does is background functionality. There is
no need for a thread per filesystem to do this work - it can al be
managed by a global workqueue now they manage concurrency
effectively.

Introduce a new gglobal xfssyncd workqueue, and convert the periodic
work to use this new functionality. To do this, use a delayed work
construct to schedule the next running of the periodic sync work
for the filesystem. When the sync work is complete, queue a new
delayed work for the next running of the sync work.

For laptop mode, we wait on completion for the sync works, so ensure
that the sync work queuing interface can flush and wait for work to
complete to enable the work queue infrastructure to replace the
current sequence number and wakeup that is used.

Because the sync work does non-trivial amounts of work, mark the
new work queue as CPU intensive.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>


# 055388a3 03-Jan-2011 Dave Chinner <dchinner@redhat.com>

xfs: dynamic speculative EOF preallocation

Currently the size of the speculative preallocation during delayed
allocation is fixed by either the allocsize mount option of a
default size. We are seeing a lot of cases where we need to
recommend using the allocsize mount option to prevent fragmentation
when buffered writes land in the same AG.

Rather than using a fixed preallocation size by default (up to 64k),
make it dynamic by basing it on the current inode size. That way the
EOF preallocation will increase as the file size increases. Hence
for streaming writes we are much more likely to get large
preallocations exactly when we need it to reduce fragementation.

For default settings, the size of the initial extents is determined
by the number of parallel writers and the amount of memory in the
machine. For 4GB RAM and 4 concurrent 32GB file writes:

EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..1048575]: 1048672..2097247 0 (1048672..2097247) 1048576
1: [1048576..2097151]: 5242976..6291551 0 (5242976..6291551) 1048576
2: [2097152..4194303]: 12583008..14680159 0 (12583008..14680159) 2097152
3: [4194304..8388607]: 25165920..29360223 0 (25165920..29360223) 4194304
4: [8388608..16777215]: 58720352..67108959 0 (58720352..67108959) 8388608
5: [16777216..33554423]: 117440584..134217791 0 (117440584..134217791) 16777208
6: [33554424..50331511]: 184549056..201326143 0 (184549056..201326143) 16777088
7: [50331512..67108599]: 251657408..268434495 0 (251657408..268434495) 16777088

and for 16 concurrent 16GB file writes:

EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..262143]: 2490472..2752615 0 (2490472..2752615) 262144
1: [262144..524287]: 6291560..6553703 0 (6291560..6553703) 262144
2: [524288..1048575]: 13631592..14155879 0 (13631592..14155879) 524288
3: [1048576..2097151]: 30408808..31457383 0 (30408808..31457383) 1048576
4: [2097152..4194303]: 52428904..54526055 0 (52428904..54526055) 2097152
5: [4194304..8388607]: 104857704..109052007 0 (104857704..109052007) 4194304
6: [8388608..16777215]: 209715304..218103911 0 (209715304..218103911) 8388608
7: [16777216..33554423]: 452984848..469762055 0 (452984848..469762055) 16777208

Because it is hard to take back specualtive preallocation, cases
where there are large slow growing log files on a nearly full
filesystem may cause premature ENOSPC. Hence as the filesystem nears
full, the maximum dynamic prealloc size іs reduced according to this
table (based on 4k block size):

freespace max prealloc size
>5% full extent (8GB)
4-5% 2GB (8GB >> 2)
3-4% 1GB (8GB >> 3)
2-3% 512MB (8GB >> 4)
1-2% 256MB (8GB >> 5)
<1% 128MB (8GB >> 6)

This should reduce the amount of space held in speculative
preallocation for such cases.

The allocsize mount option turns off the dynamic behaviour and fixes
the prealloc size to whatever the mount option specifies. i.e. the
behaviour is unchanged.

Signed-off-by: Dave Chinner <dchinner@redhat.com>


# 6c77b0ea 06-Oct-2010 Christoph Hellwig <hch@infradead.org>

xfs: remove xfs_cred.h

We're not actually passing around credentials inside XFS for a while
now, so remove all xfs_cred.h with it's cred_t typedef and all
instances of it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>


# 96540c78 29-Sep-2010 Christoph Hellwig <hch@infradead.org>

xfs: do not use xfs_mod_incore_sb for per-cpu counters

Export xfs_icsb_modify_counters and always use it for modifying
the per-cpu counters. Remove support for per-cpu counters from
xfs_mod_incore_sb to simplify it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>


# 61ba35de 29-Sep-2010 Christoph Hellwig <hch@infradead.org>

xfs: remove XFS_MOUNT_NO_PERCPU_SB

Fail the mount if we can't allocate memory for the per-CPU counters.
This is consistent with how we handle everything else in the mount
path and makes the superblock counter modification a lot simpler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>


# 65d0f205 24-Sep-2010 Dave Chinner <dchinner@redhat.com>

xfs: split inode AG walking into separate code for reclaim

The reclaim walk requires different locking and has a slightly
different walk algorithm, so separate it out so that it can be
optimised separately.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>


# a64afb05 20-Jul-2010 Christoph Hellwig <hch@infradead.org>

xfs: remove obsolete osyncisosync mount option

Since Linux 2.6.33 the kernel has support for real O_SYNC, which made
the osyncisosync option a no-op. Warn the users about this and remove
the mount flag for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 288699fe 23-Jun-2010 Christoph Hellwig <hch@infradead.org>

xfs: drop dmapi hooks

Dmapi support was never merged upstream, but we still have a lot of hooks
bloating XFS for it, all over the fast pathes of the filesystem.

This patch drops over 700 lines of dmapi overhead. If we'll ever get HSM
support in mainline at least the namespace events can be done much saner
in the VFS instead of the individual filesystem, so it's not like this
is much help for future work.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>


# 70e60ce7 19-Jul-2010 Dave Chinner <dchinner@redhat.com>

xfs: convert inode shrinker to per-filesystem contexts

Now the shrinker passes us a context, wire up a shrinker context per
filesystem. This allows us to remove the global mount list and the
locking problems that introduced. It also means that a shrinker call
does not need to traverse clean filesystems before finding a
filesystem with reclaimable inodes. This significantly reduces
scanning overhead when lots of filesystems are present.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 71e330b5 20-May-2010 Dave Chinner <dchinner@redhat.com>

xfs: Introduce delayed logging core code

The delayed logging code only changes in-memory structures and as
such can be enabled and disabled with a mount option. Add the mount
option and emit a warning that this is an experimental feature that
should not be used in production yet.

We also need infrastructure to track committed items that have not
yet been written to the log. This is what the Committed Item List
(CIL) is for.

The log item also needs to be extended to track the current log
vector, the associated memory buffer and it's location in the Commit
Item List. Extend the log item and log vector structures to enable
this tracking.

To maintain the current log format for transactions with delayed
logging, we need to introduce a checkpoint transaction and a context
for tracking each checkpoint from initiation to transaction
completion. This includes adding a log ticket for tracking space
log required/used by the context checkpoint.

To track all the changes we need an io vector array per log item,
rather than a single array for the entire transaction. Using the new
log vector structure for this requires two passes - the first to
allocate the log vector structures and chain them together, and the
second to fill them out. This log vector chain can then be passed
to the CIL for formatting, pinning and insertion into the CIL.

Formatting of the log vector chain is relatively simple - it's just
a loop over the iovecs on each log vector, but it is made slightly
more complex because we re-write the iovec after the copy to point
back at the memory buffer we just copied into.

This code also needs to pin log items. If the log item is not
already tracked in this checkpoint context, then it needs to be
pinned. Otherwise it is already pinned and we don't need to pin it
again.

The only other complexity is calculating the amount of new log space
the formatting has consumed. This needs to be accounted to the
transaction in progress, and the accounting is made more complex
becase we need also to steal space from it for log metadata in the
checkpoint transaction. Calculate all this at insert time and update
all the tickets, counters, etc correctly.

Once we've formatted all the log items in the transaction, attach
the busy extents to the checkpoint context so the busy extents live
until checkpoint completion and can be processed at that point in
time. Transactions can then be freed at this point in time.

Now we need to issue checkpoints - we are tracking the amount of log space
used by the items in the CIL, so we can trigger background checkpoints when the
space usage gets to a certain threshold. Otherwise, checkpoints need ot be
triggered when a log synchronisation point is reached - a log force event.

Because the log write code already handles chained log vectors, writing the
transaction is trivial, too. Construct a transaction header, add it
to the head of the chain and write it into the log, then issue a
commit record write. Then we can release the checkpoint log ticket
and attach the context to the log buffer so it can be called during
Io completion to complete the checkpoint.

We also need to allow for synchronising multiple in-flight
checkpoints. This is needed for two things - the first is to ensure
that checkpoint commit records appear in the log in the correct
sequence order (so they are replayed in the correct order). The
second is so that xfs_log_force_lsn() operates correctly and only
flushes and/or waits for the specific sequence it was provided with.

To do this we need a wait variable and a list tracking the
checkpoint commits in progress. We can walk this list and wait for
the checkpoints to change state or complete easily, an this provides
the necessary synchronisation for correct operation in both cases.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>


# 9bf729c0 28-Apr-2010 Dave Chinner <dchinner@redhat.com>

xfs: add a shrinker to background inode reclaim

On low memory boxes or those with highmem, kernel can OOM before the
background reclaims inodes via xfssyncd. Add a shrinker to run inode
reclaim so that it inode reclaim is expedited when memory is low.

This is more complex than it needs to be because the VM folk don't
want a context added to the shrinker infrastructure. Hence we need
to add a global list of XFS mount structures so the shrinker can
traverse them.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# d7658d48 17-Feb-2010 Christoph Hellwig <hch@infradead.org>

xfs: kill xfs_lrw.h

Move the two declarations to better fitting headers now that
xfs_lrw.c is gone.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>


# 003cb608 01-Feb-2010 Tejun Heo <tj@kernel.org>

percpu: add __percpu sparse annotations to fs

Add __percpu sparse annotations to fs.

These annotations are to make sparse consider percpu variables to be
in a different address space and warn if accessed without going
through percpu accessors. This patch doesn't affect normal builds.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>


# d5db0f97 05-Feb-2010 Eric Sandeen <sandeen@sandeen.net>

xfs: more reserved blocks fixups

This mangles the reserved blocks counts a little more.

1) add a helper function for the default reserved count
2) add helper functions to save/restore counts on ro/rw
3) save/restore reserved blocks on freeze/thaw
4) disallow changing reserved count while readonly

V2: changed field name to match Dave's changes

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Alex Elder <aelder@sgi.com>


# cbe132a8 25-Jan-2010 Dave Chinner <david@fromorbit.com>

xfs: don't hold onto reserved blocks on remount,ro

If we hold onto reserved blocks when doing a remount,ro we end
up writing the blocks used count to disk that includes the reserved
blocks. Reserved blocks are not actually used, so this results in
the values in the superblock being incorrect.

Hence if we run xfs_check or xfs_repair -n while the filesystem is
mounted remount,ro we end up with an inconsistent filesystem being
reported. Also, running xfs_copy on the remount,ro filesystem will
result in an inconsistent image being generated.

To fix this, unreserve the blocks when doing the remount,ro, and
reserved them again on remount,rw. This way a remount,ro filesystem
will appear consistent on disk to all utilities.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 046ea753 19-Jan-2010 Dave Chinner <david@fromorbit.com>

xfs: convert DM ops to use unsigned char names

dmops uses a signed char for it's namespace event. To be consistent
with the rest of the code, convert them to unsigned char for the
namespace string.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 0fa800fb 11-Jan-2010 Dave Chinner <david@fromorbit.com>

xfs: Add trace points for per-ag refcount debugging.

Uninline xfs_perag_{get,put} so that tracepoints can be inserted
into them to speed debugging of reference count problems.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>


# aed3bb90 11-Jan-2010 Dave Chinner <david@fromorbit.com>

xfs: Reference count per-ag structures

Reference count the per-ag structures to ensure that we keep get/put
pairs balanced. Assert that the reference counts are zero at unmount
time to catch leaks. In future, reference counts will enable us to
safely remove perag structures by allowing us to detect when they
are no longer in use.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>


# 1c1c6ebc 11-Jan-2010 Dave Chinner <david@fromorbit.com>

xfs: Replace per-ag array with a radix tree

The use of an array for the per-ag structures requires reallocation
of the array when growing the filesystem. This requires locking
access to the array to avoid use after free situations, and the
locking is difficult to get right. To avoid needing to reallocate an
array, change the per-ag structures to an allocated object per ag
and index them using a tree structure.

The AGs are always densely indexed (hence the use of an array), but
the number supported is 2^32 and lookups tend to be random and hence
indexing needs to scale. A simple choice is a radix tree - it works
well with this sort of index. This change also removes another
large contiguous allocation from the mount/growfs path in XFS.

The growing process now needs to change to only initialise the new
AGs required for the extra space, and as such only needs to
exclusively lock the tree for inserts. The rest of the code only
needs to lock the tree while doing lookups, and hence this will
remove all the deadlocks that currently occur on the m_perag_lock as
it is now an innermost lock. The lock is also changed to a spinlock
from a read/write lock as the hold time is now extremely short.

To complete the picture, the per-ag structures will need to be
reference counted to ensure that we don't free/modify them while
they are still in use. This will be done in subsequent patch.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>


# 5017e97d 11-Jan-2010 Dave Chinner <david@fromorbit.com>

xfs: rename xfs_get_perag

xfs_get_perag is really getting the perag that an inode belongs to
based on it's inode number. Convert the use of this function to just
get the perag from a provided ag number. Use this new function to
obtain the per-ag structure when traversing the per AG inode trees
for sync and reclaim.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>


# 30ac0683 14-Nov-2009 Christoph Hellwig <hch@infradead.org>

xfs: cleanup dmapi macros in the umount path

Stop the flag saving as we never mangle those in the unmount path, and
hide all the weird arguents to the dmapi code inside the
XFS_SEND_PREUNMOUNT / XFS_SEND_UNMOUNT macros.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>


# b8f82a4a 14-Nov-2009 Christoph Hellwig <hch@infradead.org>

xfs: kill the STATIC_INLINE macro

Remove our own STATIC_INLINE macro. For small function inside
implementation files just use STATIC and let gcc inline it, and for
those in headers do the normal static inline - they are all small
enough to be inlined for debug builds, too.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>


# d96f8f89 01-Jul-2009 Eric Sandeen <sandeen@sandeen.net>

xfs: add more statics & drop some unused functions

A lot more functions could be made static, but they need
forward declarations; this does some easy ones, and also
found a few unused functions in the process.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>


# 370f0482 01-Jul-2009 Eric Sandeen <sandeen@sandeen.net>

xfs: add more statics & drop some unused functions

A lot more functions could be made static, but they need
forward declarations; this does some easy ones, and also
found a few unused functions in the process.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>


# 7d095257 08-Jun-2009 Christoph Hellwig <hch@lst.de>

xfs: kill xfs_qmops

Kill the quota ops function vector and replace it with direct calls or
stubs in the CONFIG_XFS_QUOTA=n case.

Make sure we check XFS_IS_QUOTA_RUNNING in the right spots. We can remove
the number of those checks because the XFS_TRANS_DQ_DIRTY flag can't be set
otherwise.

This brings us back closer to the way this code worked in IRIX and earlier
Linux versions, but we keep a lot of the more useful factoring of common
code.

Eventually we should also kill xfs_qm_bhv.c, but that's left for a later
patch.

Reduces the size of the source code by about 250 lines and the size of
XFS module by about 1.5 kilobytes with quotas enabled:

text data bss dec hex filename
615957 2960 3848 622765 980ad fs/xfs/xfs.o
617231 3152 3848 624231 98667 fs/xfs/xfs.o.old

Fallout:

- xfs_qm_dqattach is split into xfs_qm_dqattach_locked which expects
the inode locked and xfs_qm_dqattach which does the locking around it,
thus removing XFS_QMOPT_ILOCKED.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>


# a8d770d9 06-Apr-2009 Dave Chinner <david@fromorbit.com>

xfs: use xfs_sync_inodes() for device flushing

Currently xfs_device_flush calls sync_blockdev() which is
a no-op for XFS as all it's metadata is held in a different
address to the one sync_blockdev() works on.

Call xfs_sync_inodes() instead to flush all the delayed
allocation blocks out. To do this as efficiently as possible,
do it via two passes - one to do an async flush of all the
dirty blocks and a second to wait for all the IO to complete.
This requires some modification to the xfs-sync_inodes_ag()
flush code to do efficiently.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 1a5902c5 29-Mar-2009 Christoph Hellwig <hch@lst.de>

xfs: remove m_attroffset

With the upcoming v3 inodes the default attroffset needs to be calculated
for each specific inode, so we can't cache it in the superblock anymore.

Also replace the assert for wrong inode sizes with a proper error check
also included in non-debug builds. Note that the ENOSYS return for
that might seem odd, but that error is returned by xfs_mount_validate_sb
for all theoretically valid but not supported filesystem geometries.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>


# 9da096fd 29-Mar-2009 Malcolm Parsons <malcolm.parsons@gmail.com>

xfs: fix various typos

Signed-off-by: Malcolm Parsons <malcolm.parsons@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 6447c362 29-Mar-2009 Christoph Hellwig <hch@lst.de>

xfs: remove m_litino

With the upcoming v3 inodes the inode data/attr area size needs to be
calculated for each specific inode, so we can't cache it in the superblock
anymore.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>


# a19d9f88 29-Mar-2009 Christoph Hellwig <hch@lst.de>

xfs: kill ino64 mount option

The ino64 mount option adds a fixed offset to 32bit inode numbers
to bring them into the 64bit range. There's no need for this kind
of debug tool given that it's easy to produce real 64bit inode numbers
for testing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>


# fcafb71b 09-Feb-2009 Christoph Hellwig <hch@lst.de>

xfs: get rid of indirections in the quotaops implementation

Currently we call from the nicely abstracted linux quotaops into a ugly
multiplexer just to split the calls out at the same boundary again.
Rewrite the quota ops handling to remove that obfucation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>


# 0d87e656 09-Feb-2009 Christoph Hellwig <hch@lst.de>

xfs: remove superflous inobt macros

xfs_ialloc_btree.h has a a cuple of macros that only obsfucate the code
but don't provide any abstraction benefits. This patches removes those
and cleans up the reamaining defintions up a little.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>


# c52e9fd8 04-Feb-2009 Christoph Hellwig <hch@lst.de>

xfs: remove unused XFS_MOUNT_ILOCK/XFS_MOUNT_IUNLOCK

These aren't only unused but also reference a lock that doesn't exist anymore.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>


# b6e32227 14-Jan-2009 Eric Sandeen <sandeen@sandeen.net>

[XFS] Remove the rest of the macro-to-function indirections.

Remove the last of the macros-defined-to-static-functions.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# 49739140 18-Jan-2009 Christoph Hellwig <hch@lst.de>

xfs: fix bad_features2 fixups for the root filesystem

Currently the bad_features2 fixup and the alignment updates in the superblock
are skipped if we mount a filesystem read-only. But for the root filesystem
the typical case is to mount read-only first and only later remount writeable
so we'll never perform this update at all. It's not a big problem but means
the logs of people needing the fixup get spammed at every boot because they
never happen on disk.

Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>


# 7884bc86 18-Jan-2009 Christoph Hellwig <hch@lst.de>

xfs: fix bad_features2 fixups for the root filesystem

Currently the bad_features2 fixup and the alignment updates in the superblock
are skipped if we mount a filesystem read-only. But for the root filesystem
the typical case is to mount read-only first and only later remount writeable
so we'll never perform this update at all. It's not a big problem but means
the logs of people needing the fixup get spammed at every boot because they
never happen on disk.

Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>


# 9d87c319 14-Jan-2009 Eric Sandeen <sandeen@sandeen.net>

[XFS] Remove the rest of the macro-to-function indirections.

Remove the last of the macros-defined-to-static-functions.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# 6d73cf13 09-Dec-2008 Christoph Hellwig <hch@infradead.org>

[XFS] resync headers with libxfs

- xfs_sb.h add the XFS_SB_VERSION2_PARENTBIT features2 that has been
around in userspace for some time
- xfs_inode.h: move a few things out of __KERNEL__ that are needed by
userspace
- xfs_mount.h: only include xfs_sync.h under __KERNEL__
- xfs_inode.c: minor whitespace fixup. I accidentaly changes this when
importing this file for use by userspace.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# f95099ba 02-Dec-2008 Christoph Hellwig <hch@lst.de>

kill xfs_unmount_flush

There's almost nothing left in this function, instead remove the IRELE
on the real times inodes and the call to XFS_QM_UNMOUNT into xfs_unmountfs.

For the regular unmount case that means it now also happenes after dmapi
notification, but otherwise there is no difference in behaviour.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>


# e57481dc 02-Dec-2008 Christoph Hellwig <hch@lst.de>

no explicit xfs_iflush for special inodes during unmount

Currently we explicitly call xfs_iflush on the quota, real-time and root
inodes from xfs_unmount_flush. But we just called xfs_sync_inodes with
SYNC_ATTR and do an XFS_bflush aka xfs_flush_buftarg to make sure all inodes
are on disk already, so there is no need for these special cases.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>


# b56757be 02-Dec-2008 Christoph Hellwig <hch@lst.de>

remove leftovers of shared read-only support

We never supported shared read-only filesystems, so remove the dead
code left over from IRIX for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>


# e88f11ab 02-Dec-2008 Christoph Hellwig <hch@lst.de>

remove unused m_inode_quiesce member from struct xfs_mount

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>


# 5cafdeb2 02-Dec-2008 Christoph Hellwig <hch@lst.de>

cleanup the inode reclaim path

Merge xfs_iextract and xfs_idestroy into xfs_ireclaim as they are never
called individually. Also rewrite most comments in this area as they
were severly out of date.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>


# 2b5decd0 27-Nov-2008 Christoph Hellwig <hch@lst.de>

[XFS] remove xfs_vfs.h

The only thing left are the forced shutdown flags and freeze macros which
fit into xfs_mount.h much better.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>


# 00dd4029 27-Nov-2008 Christoph Hellwig <hch@lst.de>

[XFS] remove bhv_statvfs_t typedef

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>


# 9d565ffa 30-Oct-2008 Christoph Hellwig <hch@infradead.org>

[XFS] kill struct xfs_mount_args

No need to parse the mount option into a structure before applying it to
struct xfs_mount.

The content of xfs_start_flags gets merged into xfs_parseargs. Calls
inbetween don't care and can use mount members instead of the args struct.

This patch uncovered that the mount option for shared filesystems wasn't
ever exposed on Linux. The code to handle it is #if 0'ed in this patch
pending a decision on this feature. I'll send a writeup about it to the
list soon.

SGI-PV: 987246

SGI-Modid: xfs-linux-melb:xfs-kern:32371a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# c7e8f268 30-Oct-2008 David Chinner <david@fromorbit.com>

[XFS] Move the AIL lock into the struct xfs_ail

Bring the ail lock inside the struct xfs_ail. This means the AIL can be
entirely manipulated via the struct xfs_ail rather than needing both the
struct xfs_mount and the struct xfs_ail.

SGI-PV: 988143

SGI-Modid: xfs-linux-melb:xfs-kern:32350a

Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>


# 82fa9012 30-Oct-2008 David Chinner <david@fromorbit.com>

[XFS] Allocate the struct xfs_ail

Rather than embedding the struct xfs_ail in the struct xfs_mount, allocate
it during AIL initialisation. Add a back pointer to the struct xfs_ail so
that we can pass around the xfs_ail and still be able to access the
xfs_mount if need be. This is th first step involved in isolating the AIL
implementation from the surrounding filesystem code.

SGI-PV: 988143

SGI-Modid: xfs-linux-melb:xfs-kern:32346a

Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>


# 11654513 30-Oct-2008 David Chinner <david@fromorbit.com>

[XFS] kill deleted inodes list

Now that the deleted inodes list is unused, kill it. This also removes the
i_reclaim list head from the xfs_inode, shrinking it by two pointers.

SGI-PV: 988142

SGI-Modid: xfs-linux-melb:xfs-kern:32334a

Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>


# 2030b5ab 30-Oct-2008 David Chinner <david@fromorbit.com>

[XFS] use xfs_sync_inodes rather than xfs_syncsub

Kill the unused arg in xfs_syncsub() and xfs_sync_inodes(). For callers of
xfs_syncsub() that only want to flush inodes, replace xfs_syncsub() with
direct calls to xfs_sync_inodes() as that is all that is being done with
the specific flags being passed in.

SGI-PV: 988140

SGI-Modid: xfs-linux-melb:xfs-kern:32305a

Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>


# 6c7699c0 30-Oct-2008 David Chinner <david@fromorbit.com>

[XFS] remove the mount inode list

Now we've removed all users of the mount inode list, we can kill it. This
reduces the size of the xfs_inode by 2 pointers.

SGI-PV: 988139

SGI-Modid: xfs-linux-melb:xfs-kern:32293a

Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>


# 60197e8d 30-Oct-2008 Christoph Hellwig <hch@infradead.org>

[XFS] Cleanup maxrecs calculation.

Clean up the way the maximum and minimum records for the btree blocks are
calculated. For the alloc and inobt btrees all the values are
pre-calculated in xfs_mount_common, and we switch the current loop around
the ugly generic macros that use cpp token pasting to generate type names
to two small helpers in normal C code. For the bmbt and bmdr trees these
helpers also exist, but can be called during runtime, too. Here we also
kill various macros dealing with them and inline the logic into the
get_minrecs / get_maxrecs / get_dmaxrecs methods in xfs_bmap_btree.c.

Note that all these new helpers take an xfs_mount * argument which will be
needed to determine the size of a btree block once we add support for
extended btree blocks with CRCs and other RAS information.

SGI-PV: 988146

SGI-Modid: xfs-linux-melb:xfs-kern:32292a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# a167b17e 30-Oct-2008 David Chinner <david@fromorbit.com>

[XFS] move xfssyncd code to xfs_sync.c

Move all the xfssyncd code to the new xfs_sync.c file. This places it
closer to the actual code that it interacts with, rather than just being
associated with high level VFS code.

SGI-PV: 988139

SGI-Modid: xfs-linux-melb:xfs-kern:32283a

Signed-off-by: David Chinner <david@fromorbit.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>


# 847fff5c 30-Oct-2008 Barry Naujok <bnaujok@sgi.com>

[XFS] Sync up kernel and user-space headers

SGI-PV: 986558

SGI-Modid: xfs-linux-melb:xfs-kern:32231a

Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# 46039928 29-Oct-2008 Barry Naujok <bnaujok@sgi.com>

[XFS] Remove final remnants of dirv1 macros and other stuff

SGI-PV: 981498

SGI-Modid: xfs-linux-melb:xfs-kern:32002a

Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# 41b5c2e7 13-Aug-2008 Christoph Hellwig <hch@infradead.org>

[XFS] xfs_unmountfs should return void

xfs_unmounts can't and shouldn't return errors so declare it as returning
void.

SGI-PV: 981498

SGI-Modid: xfs-linux-melb:xfs-kern:31833a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# 4249023a 13-Aug-2008 Christoph Hellwig <hch@infradead.org>

[XFS] cleanup xfs_mountfs

Remove all the useless flags and code keyed off it in xfs_mountfs.

SGI-PV: 981498

SGI-Modid: xfs-linux-melb:xfs-kern:31831a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# 9f8868ff 18-Jul-2008 Christoph Hellwig <hch@infradead.org>

[XFS] streamline init/exit path

Currently the xfs module init/exit code is a mess. It's farmed out over a
lot of function with very little error checking. This patch makes sure we
propagate all initialization failures properly and clean up after them.
Various runtime initializations are replaced with compile-time
initializations where possible to make this easier. The exit path is
similarly consolidated.

There's now split out function to create/destroy the kmem zones and
alloc/free the trace buffers. I've also changed the ktrace allocations to
KM_MAYFAIL and handled errors resulting from that.

And yes, we really should replace the XFS_*_TRACE ifdefs with a single
XFS_TRACE..

SGI-PV: 976035

SGI-Modid: xfs-linux-melb:xfs-kern:31354a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# 5163f95a 21-May-2008 Barry Naujok <bnaujok@sgi.com>

[XFS] Name operation vector for hash and compare

Adds two pieces of functionality for the basis of case-insensitive support
in XFS:

1. A comparison result enumerated type: xfs_dacmp. It represents an

exact match, case-insensitive match or no match at all. This patch

only implements different and exact results.

2. xfs_nameops vector for specifying how to perform the hash generation

of filenames and comparision methods. In this patch the hash vector

points to the existing xfs_da_hashname function and the comparison

method does a length compare, and if the same, does a memcmp and

return the xfs_dacmp result.

All filename functions that use the hash (create, lookup remove, rename,
etc) now use the xfs_nameops.hashname function and all directory lookup
functions also use the xfs_nameops.compname function.

The lookup functions also handle case-insensitive results even though the
default comparison function cannot return that. And important aspect of
the lookup functions is that an exact match always has precedence over a
case-insensitive. So while a case-insensitive match is found, we have to
keep looking just in case there is an exact match. In the meantime, the
info for the first case-insensitive match is retained if no exact match is
found.

SGI-PV: 981519
SGI-Modid: xfs-linux-melb:xfs-kern:31205a

Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>


# c962fb79 19-May-2008 Christoph Hellwig <hch@infradead.org>

[XFS] kill xfs_mount_init

xfs_mount_init is inlined into xfs_fs_fill_super and allocation switched
to kzalloc. Plug a leak of the mount structure for most early mount
failures. Move xfs_icsb_init_counters to as late as possible in the mount
path and make sure to undo it so that no stale hotplug cpu notifiers are
left around on mount failures.

SGI-PV: 981951
SGI-Modid: xfs-linux-melb:xfs-kern:31196a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# 19f354d4 19-May-2008 Christoph Hellwig <hch@infradead.org>

[XFS] sort out opening and closing of the block devices

Currently closing the rt/log block device is done in the wrong spot, and
far too early. So revampt it:

- xfs_blkdev_put moved out of xfs_free_buftarg into the caller so that

it is done after tearing down the buftarg completely.

- call to xfs_unmountfs_close moved from xfs_mountfs into caller so

that it's done after tearing down the filesystem completely.

- xfs_unmountfs_close is renamed to xfs_close_devices and made static

in xfs_super.c

- opening of the block devices is split into a helper xfs_open_devices

that is symetric in use to xfs_close_devices

- xfs_unmountfs can now lose struct cred

- error handling around device opening sanitized in xfs_fs_fill_super

SGI-PV: 981951
SGI-Modid: xfs-linux-melb:xfs-kern:31193a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# 7c12f296 30-Apr-2008 Tim Shimmin <tes@sgi.com>

[XFS] Fix up noattr2 so that it will properly update the versionnum and
features2 fields.

Previously, mounting with noattr2 failed to achieve anything because
although it cleared the attr2 mount flag, it would set it again as soon as
it processed the superblock fields. The fix now has an explicit noattr2
flag and uses it later to fix up the versionnum and features2 fields.

SGI-PV: 980021
SGI-Modid: xfs-linux-melb:xfs-kern:31003a

Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# 7155054c 28-Apr-2008 Eric Sandeen <sandeen@sandeen.net>

[XFS] fix non-smp xfs build

xfs_reserve_blocks() calls xfs_icsb_sync_counters_locked(), which is not
defined if !CONFIG_SMP/!HAVE_PERCPU_SB

SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30991a

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# ce46193b 22-Apr-2008 Christoph Hellwig <hch@infradead.org>

[XFS] kill XFS_ICSB_SB_LOCKED

With the last two patches XFS_ICSB_SB_LOCKED is never checked and only
superflously passed to xfs_icsb_count, so kill it.

SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30920a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# d4d90b57 22-Apr-2008 Christoph Hellwig <hch@infradead.org>

[XFS] Add xfs_icsb_sync_counters_locked for when m_sb_lock already held

Add a new xfs_icsb_sync_counters_locked for the case where m_sb_lock
is already taken and add a flags argument to xfs_icsb_sync_counters so
that xfs_icsb_sync_counters_flags is not needed.

SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30917a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# 556b8b16 09-Apr-2008 Barry Naujok <bnaujok@sgi.com>

[XFS] remove bhv_vname_t and xfs_rename code

SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30804a

Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# 535f6b37 27-Mar-2008 Josef 'Jeff' Sipek <jeffpc@josefsipek.net>

[XFS] Replace custom AIL linked-list code with struct list_head

Replace the xfs_ail_entry_t with a struct list_head and clean the
surrounding code up. Also fixes a livelock in xfs_trans_first_push_ail()
by terminating the loop at the head of the list correctly.

SGI-PV: 978682
SGI-Modid: xfs-linux-melb:xfs-kern:30636a

Signed-off-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# bc4ac74a 05-Mar-2008 Christoph Hellwig <hch@infradead.org>

[XFS] cleanup vnode use in dmapi calls

SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30545a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# 126468b1 05-Mar-2008 Christoph Hellwig <hch@infradead.org>

[XFS] kill xfs_rwlock/xfs_rwunlock

We can just use xfs_ilock/xfs_iunlock instead and get rid of the ugly
bhv_vrwlock_t.

SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30533a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# 1bd960ee 28-Feb-2008 Josef Jeff Sipek <jeffpc@josefsipek.net>

[XFS] If you mount an XFS filesystem with no mount options at all, then
the "ikeep" option is set rather than "noikeep".

This regression was introduced in 970451.

With no mount options specified, xfs_parseargs() does the following:

int ikeep = 0;

args->flags |= XFSMNT_BARRIER;

args->flags2 |= XFSMNT2_COMPAT_IOSIZE;

if (!options)

goto done;

It only sets the above two options by default and before, it also used to
set XFSMNT_IDELETE by default.

If options are specified, then

if (!(args->flags & XFSMNT_DMAPI) && !ikeep)

args->flags |= XFSMNT_IDELETE;

is executed later on which is skipped by the "goto done;" above.

The solution is to invert the logic.

SGI-PV: 977771
SGI-Modid: xfs-linux-melb:xfs-kern:30590a

Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# 249a8c11 04-Feb-2008 David Chinner <dgc@sgi.com>

[XFS] Move AIL pushing into it's own thread

When many hundreds to thousands of threads all try to do simultaneous
transactions and the log is in a tail-pushing situation (i.e. full), we
can get multiple threads walking the AIL list and contending on the AIL
lock.

The AIL push is, in effect, a simple I/O dispatch algorithm complicated by
the ordering constraints placed on it by the transaction subsystem. It
really does not need multiple threads to push on it - even when only a
single CPU is pushing the AIL, it can push the I/O out far faster that
pretty much any disk subsystem can handle.

So, to avoid contention problems stemming from multiple list walkers, move
the list walk off into another thread and simply provide a "target" to
push to. When a thread requires a push, it sets the target and wakes the
push thread, then goes to sleep waiting for the required amount of space
to become available in the log.

This mechanism should also be a lot fairer under heavy load as the waiters
will queue in arrival order, rather than queuing in "who completed a push
first" order.

Also, by moving the pushing to a separate thread we can do more
effectively overload detection and prevention as we can keep context from
loop iteration to loop iteration. That is, we can push only part of the
list each loop and not have to loop back to the start of the list every
time we run. This should also help by reducing the number of items we try
to lock and/or push items that we cannot move.

Note that this patch is not intended to solve the inefficiencies in the
AIL structure and the associated issues with extremely large list
contents. That needs to be addresses separately; parallel access would
cause problems to any new structure as well, so I'm only aiming to isolate
the structure from unbounded parallelism here.

SGI-PV: 972759
SGI-Modid: xfs-linux-melb:xfs-kern:30371a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>


# 613d7043 11-Oct-2007 Christoph Hellwig <hch@infradead.org>

[XFS] kill xfs_iocore_t

xfs_iocore_t is a structure embedded in xfs_inode. Except for one field it
just duplicates fields already in xfs_inode, and there is nothing this
abstraction buys us on XFS/Linux. This patch removes it and shrinks source
and binary size of xfs aswell as shrinking the size of xfs_inode by 60/44
bytes in debug/non-debug builds.

SGI-PV: 970852
SGI-Modid: xfs-linux-melb:xfs-kern:29754a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# 36e41eeb 11-Oct-2007 Eric Sandeen <sandeen@sandeen.net>

[XFS] Cleanup lock goop.

Switch last couple lock_t's to spinlock_t's. Remove now-unused
spinlock-related macros & types.

SGI-PV: 970382
SGI-Modid: xfs-linux-melb:xfs-kern:29748a

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# 3685c2a1 11-Oct-2007 Eric Sandeen <sandeen@sandeen.net>

[XFS] Unwrap XFS_SB_LOCK.

Un-obfuscate XFS_SB_LOCK, remove XFS_SB_LOCK->mutex_lock->spin_lock
macros, call spin_lock directly, remove extraneous cookie holdover from
old xfs code, and change lock type to spinlock_t.

SGI-PV: 970382
SGI-Modid: xfs-linux-melb:xfs-kern:29746a

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# 287f3dad 11-Oct-2007 Donald Douwsma <donaldd@sgi.com>

[XFS] Unwrap AIL_LOCK

SGI-PV: 970382
SGI-Modid: xfs-linux-melb:xfs-kern:29739a

Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# 541d7d3c 11-Oct-2007 Lachlan McIlroy <lachlan@sgi.com>

[XFS] kill unnessecary ioops indirection

Currently there is an indirection called ioops in the XFS data I/O path.
Various functions are called by functions pointers, but there is no
coherence in what this is for, and of course for XFS itself it's entirely
unused. This patch removes it instead and significantly reduces source and
binary size of XFS while making maintaince easier.

SGI-PV: 970841
SGI-Modid: xfs-linux-melb:xfs-kern:29737a

Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# cc92e7ac 30-Aug-2007 Christoph Hellwig <hch@infradead.org>

[XFS] growlock should be a mutex

m_growlock only needs plain binary mutex semantics, so use a struct mutex
instead of a semaphore for it.

SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29512a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# b267ce99 30-Aug-2007 Christoph Hellwig <hch@infradead.org>

[XFS] kill struct bhv_vfs

Now that struct bhv_vfs doesn't have any members left we can kill it and
go directly from the super_block to the xfs_mount everywhere.

SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29509a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# 74394496 30-Aug-2007 Christoph Hellwig <hch@infradead.org>

[XFS] move syncing related members from struct bhv_vfs to struct xfs_mount

SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29508a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# bd186aa9 30-Aug-2007 Christoph Hellwig <hch@infradead.org>

[XFS] kill the vfs_flags member in struct bhv_vfs

All flags are added to xfs_mount's m_flag instead. Note that the 32bit
inode flag was duplicated in both of them, but only cleared in the mount
when it was not nessecary due to the filesystem beeing small enough. Two
flags are still required here - one to indicate the mount option setting,
and one to indicate if it applies or not.

SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29507a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# 745f6919 30-Aug-2007 Christoph Hellwig <hch@infradead.org>

[XFS] call common xfs vfs-level helpers directly and remove vfs operations

Also remove the now dead behavior code.

SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29505a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# 48c872a9f 30-Aug-2007 Christoph Hellwig <hch@infradead.org>

[XFS] decontaminate vfs operations from behavior details

All vfs ops now take struct xfs_mount pointers and the behaviour related
glue is split out into methods of its own.

SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29504a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# b09cc771 30-Aug-2007 Christoph Hellwig <hch@infradead.org>

[XFS] remove dependency of the quota module on behaviors

Mount options are now parsed by the main XFS module and rejected if quota
support is not available, and there are some new quota operation for the
quotactl syscall and calls to quote in the mount, unmount and sync
callchains.

SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29503a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# 293688ec 28-Aug-2007 Christoph Hellwig <hch@infradead.org>

[XFS] remove dependency of the dmapi module on behaviors

Mount options are now parsed by the main XFS module and rejected if dmapi
support is not available, and there is a new dm operation to send the
mount event.

SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29502a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# 0a74cd19 28-Aug-2007 Christoph Hellwig <hch@infradead.org>

[XFS] kill struct bhv_vnode

Now that struct bhv_vnode is empty we can just kill it. Retain bhv_vnode_t
as a typedef for struct inode for the time being until all the fallout is
cleaned up.

SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29500a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# 739bfb2a 28-Aug-2007 Christoph Hellwig <hch@infradead.org>

[XFS] call common xfs vnode-level helpers directly and remove vnode operations

SGI-PV: 969608
SGI-Modid: xfs-linux-melb:xfs-kern:29493a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# da353b0d 27-Aug-2007 David Chinner <dgc@sgi.com>

[XFS] Radix tree based inode caching

One of the perpetual scaling problems XFS has is indexing it's incore
inodes. We currently uses hashes and the default hash sizes chosen can
only ever be a tradeoff between memory consumption and the maximum
realistic size of the cache.

As a result, anyone who has millions of inodes cached on a filesystem
needs to tunes the size of the cache via the ihashsize mount option to
allow decent scalability with inode cache operations.

A further problem is the separate inode cluster hash, whose size is based
on the ihashsize but is smaller, and so under certain conditions (sparse
cluster cache population) this can become a limitation long before the
inode hash is causing issues.

The following patchset removes the inode hash and cluster hash and
replaces them with radix trees to avoid the scalability limitations of the
hashes. It also reduces the size of the inodes by 3 pointers....

SGI-PV: 969561
SGI-Modid: xfs-linux-melb:xfs-kern:29481a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# 2bdf7cd0 27-Aug-2007 Christoph Hellwig <hch@infradead.org>

[XFS] superblock endianess annotations

Creates a new xfs_dsb_t that is __be annotated and keeps xfs_sb_t for the
incore one. xfs_xlatesb is renamed to xfs_sb_to_disk and only handles the
incore -> disk conversion. A new helper xfs_sb_from_disk handles the other
direction and doesn't need the slightly hacky table-driven approach
because we only ever read the full sb from disk.

The handling of shared r/o filesystems has been buggy on little endian
system and fixing this required shuffling around of some code in that
area.

SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29477a

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# 40906630 16-Aug-2007 Eric Sandeen <sandeen@sandeen.net>

[XFS] Remove m_nreadaheads

m_nreadaheads in the mount struct is never used; remove it and the various
macros assigned to it. Also remove a couple other unused macros in the
same areas.

Removes one user of xfs_physmem.

SGI-PV: 968563
SGI-Modid: xfs-linux-melb:xfs-kern:29322a

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# 2a82b8be 10-Jul-2007 David Chinner <dgc@sgi.com>

[XFS] Concurrent Multi-File Data Streams

In media spaces, video is often stored in a frame-per-file format. When
dealing with uncompressed realtime HD video streams in this format, it is
crucial that files do not get fragmented and that multiple files a placed
contiguously on disk.

When multiple streams are being ingested and played out at the same time,
it is critical that the filesystem does not cross the streams and
interleave them together as this creates seek and readahead cache miss
latency and prevents both ingest and playout from meeting frame rate
targets.

This patch set creates a "stream of files" concept into the allocator to
place all the data from a single stream contiguously on disk so that RAID
array readahead can be used effectively. Each additional stream gets
placed in different allocation groups within the filesystem, thereby
ensuring that we don't cross any streams. When an AG fills up, we select a
new AG for the stream that is not in use.

The core of the functionality is the stream tracking - each inode that we
create in a directory needs to be associated with the directories' stream.
Hence every time we create a file, we look up the directories' stream
object and associate the new file with that object.

Once we have a stream object for a file, we use the AG that the stream
object point to for allocations. If we can't allocate in that AG (e.g. it
is full) we move the entire stream to another AG. Other inodes in the same
stream are moved to the new AG on their next allocation (i.e. lazy
update).

Stream objects are kept in a cache and hold a reference on the inode.
Hence the inode cannot be reclaimed while there is an outstanding stream
reference. This means that on unlink we need to remove the stream
association and we also need to flush all the associations on certain
events that want to reclaim all unreferenced inodes (e.g. filesystem
freeze).

SGI-PV: 964469
SGI-Modid: xfs-linux-melb:xfs-kern:29096a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Vlad Apostolov <vapo@sgi.com>


# 92821e2b 23-May-2007 David Chinner <dgc@sgi.com>

[XFS] Lazy Superblock Counters

When we have a couple of hundred transactions on the fly at once, they all
typically modify the on disk superblock in some way.
create/unclink/mkdir/rmdir modify inode counts, allocation/freeing modify
free block counts.

When these counts are modified in a transaction, they must eventually lock
the superblock buffer and apply the mods. The buffer then remains locked
until the transaction is committed into the incore log buffer. The result
of this is that with enough transactions on the fly the incore superblock
buffer becomes a bottleneck.

The result of contention on the incore superblock buffer is that
transaction rates fall - the more pressure that is put on the superblock
buffer, the slower things go.

The key to removing the contention is to not require the superblock fields
in question to be locked. We do that by not marking the superblock dirty
in the transaction. IOWs, we modify the incore superblock but do not
modify the cached superblock buffer. In short, we do not log superblock
modifications to critical fields in the superblock on every transaction.
In fact we only do it just before we write the superblock to disk every
sync period or just before unmount.

This creates an interesting problem - if we don't log or write out the
fields in every transaction, then how do the values get recovered after a
crash? the answer is simple - we keep enough duplicate, logged information
in other structures that we can reconstruct the correct count after log
recovery has been performed.

It is the AGF and AGI structures that contain the duplicate information;
after recovery, we walk every AGI and AGF and sum their individual
counters to get the correct value, and we do a transaction into the log to
correct them. An optimisation of this is that if we have a clean unmount
record, we know the value in the superblock is correct, so we can avoid
the summation walk under normal conditions and so mount/recovery times do
not change under normal operation.

One wrinkle that was discovered during development was that the blocks
used in the freespace btrees are never accounted for in the AGF counters.
This was once a valid optimisation to make; when the filesystem is full,
the free space btrees are empty and consume no space. Hence when it
matters, the "accounting" is correct. But that means the when we do the
AGF summations, we would not have a correct count and xfs_check would
complain. Hence a new counter was added to track the number of blocks used
by the free space btrees. This is an *on-disk format change*.

As a result of this, lazy superblock counters are a mkfs option and at the
moment on linux there is no way to convert an old filesystem. This is
possible - xfs_db can be used to twiddle the right bits and then
xfs_repair will do the format conversion for you. Similarly, you can
convert backwards as well. At some point we'll add functionality to
xfs_admin to do the bit twiddling easily....

SGI-PV: 964999
SGI-Modid: xfs-linux-melb:xfs-kern:28652a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# 4cc929ee 14-May-2007 Nathan Scott <nscott@aconex.com>

[XFS] Don't grow filesystems past the size they can index.

When growing a filesystem we don't check to see if the new size overflows
the page cache index range, so we can do silly things like grow a
filesystem page 16TB on a 32bit. Check new filesystem sizes against the
limits the kernel can support.

SGI-PV: 957886
SGI-Modid: xfs-linux-melb:xfs-kern:28563a

Signed-Off-By: Nathan Scott <nscott@aconex.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# 5478eead 10-Feb-2007 Lachlan McIlroy <lachlan@sgi.com>

[XFS] Re-initialize the per-cpu superblock counters after recovery.

After filesystem recovery the superblock is re-read to bring in any
changes. If the per-cpu superblock counters are not re-initialized from
the superblock then the next time the per-cpu counters are disabled they
might overwrite the global counter with a bogus value.

SGI-PV: 957348
SGI-Modid: xfs-linux-melb:xfs-kern:27999a

Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# c97be736 10-Feb-2007 Kevin Jamieson <kjamieson@bycast.com>

[XFS] Fix block reservation changes for non-SMP systems.

SGI-PV: 956323
SGI-Modid: xfs-linux-melb:xfs-kern:27940a

Signed-off-by: Kevin Jamieson <kjamieson@bycast.com>
Signed-off-by: David Chatterton <chatz@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# dbcabad1 10-Feb-2007 David Chinner <dgc@sgi.com>

[XFS] Fix block reservation mechanism.

The block reservation mechanism has been broken since the per-cpu
superblock counters were introduced. Make the block reservation code work
with the per-cpu counters by syncing the counters, snapshotting the amount
of available space and then doing a modifcation of the counter state
according to the result. Continue in a loop until we either have no space
available or we reserve some space.

SGI-PV: 956323
SGI-Modid: xfs-linux-melb:xfs-kern:27895a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# 20f4ebf2 10-Feb-2007 David Chinner <dgc@sgi.com>

[XFS] Make growfs work for amounts greater than 2TB

The free block modification code has a 32bit interface, limiting the size
the filesystem can be grown even on 64 bit machines. On 32 bit machines,
there are other 32bit variables in transaction structures and interfaces
that need to be expanded to allow this to work.

SGI-PV: 959978
SGI-Modid: xfs-linux-melb:xfs-kern:27894a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# 1f9b3b64 10-Feb-2007 Lachlan McIlroy <lachlan@sgi.com>

[XFS] remove unused xflags parameter from sync routines

SGI-PV: 959137
SGI-Modid: xfs-linux-melb:xfs-kern:27710a

Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# 03135cf7 10-Feb-2007 David Chinner <dgc@sgi.com>

[XFS] Fix UP build breakage due to undefined m_icsb_mutex.

SGI-PV: 952227
SGI-Modid: xfs-linux-melb:xfs-kern:27692a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# 20b64285 10-Feb-2007 David Chinner <dgc@sgi.com>

[XFS] Reduction global superblock lock contention near ENOSPC.

The existing per-cpu superblock counter code uses the global superblock
spin lock when we approach ENOSPC for global synchronisation. On larger
machines than this code was originally tested on this can still get
catastrophic spinlock contention due increasing rebalance frequency near
ENOSPC.

By introducing a sleeping lock that is used to serialise balances and
modifications near ENOSPC we prevent contention from needlessly from
wasting the CPU time of potentially hundreds of CPUs.

To reduce the number of balances occuring, we separate the need rebalance
case from the slow allocate case. Now, a counter running dry will trigger
a rebalance during which counters are disabled. Any thread that sees a
disabled counter enters a different path where it waits on the new mutex.
When it gets the new mutex, it checks if the counter is disabled. If the
counter is disabled, then we _know_ that we have to use the global counter
and lock and it is safe to do so immediately. Otherwise, we drop the mutex
and go back to trying the per-cpu counters which we know were re-enabled.

SGI-PV: 952227
SGI-Modid: xfs-linux-melb:xfs-kern:27612a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# 215101c3 27-Sep-2006 Nathan Scott <nathans@sgi.com>

[XFS] Fix kmem_zalloc_greedy warnings on 64 bit platforms.

SGI-PV: 955302
SGI-Modid: xfs-linux-melb:xfs-kern:26907a

Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# fe48cae9 27-Sep-2006 Christoph Hellwig <hch@lst.de>

[XFS] remove bhv_lookup, _range version works aswell and has more useful
semantics.

SGI-PV: 954580
SGI-Modid: xfs-linux-melb:xfs-kern:26563a

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>


# f6c2d1fa 19-Jun-2006 Nathan Scott <nathans@sgi.com>

[XFS] Remove version 1 directory code. Never functioned on Linux, just
pure bloat.

SGI-PV: 952969
SGI-Modid: xfs-linux-melb:xfs-kern:26251a

Signed-off-by: Nathan Scott <nathans@sgi.com>


# 8285fb58 09-Jun-2006 Nathan Scott <nathans@sgi.com>

[XFS] Resolve a namespace collision on remaining vtypes for FreeBSD
porters.

SGI-PV: 953338
SGI-Modid: xfs-linux-melb:xfs-kern:26108a

Signed-off-by: Nathan Scott <nathans@sgi.com>


# 67fcaa73 09-Jun-2006 Nathan Scott <nathans@sgi.com>

[XFS] Resolve a namespace collision on vnode/vnodeops for FreeBSD porters.

SGI-PV: 953338
SGI-Modid: xfs-linux-melb:xfs-kern:26107a

Signed-off-by: Nathan Scott <nathans@sgi.com>


# b83bd138 09-Jun-2006 Nathan Scott <nathans@sgi.com>

[XFS] Resolve a namespace collision on vfs/vfsops for FreeBSD porters.

SGI-PV: 9533338
SGI-Modid: xfs-linux-melb:xfs-kern:26106a

Signed-off-by: Nathan Scott <nathans@sgi.com>


# 7d04a335 08-Jun-2006 Nathan Scott <nathans@sgi.com>

[XFS] Shutdown the filesystem if all device paths have gone. Made
shutdown vop flags consistent with sync vop flags declarations too.

SGI-PV: 939911
SGI-Modid: xfs-linux-melb:xfs-kern:26096a

Signed-off-by: Nathan Scott <nathans@sgi.com>


# 3e57ecf6 08-Jun-2006 Olaf Weber <olaf@sgi.com>

[XFS] Add parameters to xfs_bmapi() and xfs_bunmapi() to have them report
the range spanned by modifications to the in-core extent map. Add
XFS_BUNMAPI() and XFS_SWAP_EXTENTS() macros that call xfs_bunmapi() and
xfs_swap_extents() via the ioops vector. Change all calls that may modify
the in-core extent map for the data fork to go through the ioops vector.
This allows a cache of extent map data to be kept in sync.

SGI-PV: 947615
SGI-Modid: xfs-linux-melb:xfs-kern:209226a

Signed-off-by: Olaf Weber <olaf@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>


# 764d1f89 30-Mar-2006 Nathan Scott <nathans@sgi.com>

[XFS] Implement the silent parameter to fill_super, previously ignored.

SGI-PV: 951299
SGI-Modid: xfs-linux-melb:xfs-kern:25632a

Signed-off-by: Nathan Scott <nathans@sgi.com>


# c41564b5 28-Mar-2006 Nathan Scott <nathans@sgi.com>

[XFS] We really suck at spulling. Thanks to Chris Pascoe for fixing all
these typos.

SGI-PV: 904196
SGI-Modid: xfs-linux-melb:xfs-kern:25539a

Signed-off-by: Nathan Scott <nathans@sgi.com>


# b8b0f546 13-Mar-2006 Nathan Scott <nathans@sgi.com>

[XFS] Remove a couple of no-longer-used macros/types from XFS.

SGI-PV: 950556
SGI-Modid: xfs-linux-melb:xfs-kern:25377a

Signed-off-by: Nathan Scott <nathans@sgi.com>


# 01e1b69c 13-Mar-2006 David Chinner <dgc@sgi.com>

[XFS] using a spinlock per cpu for superblock counter exclusion results in
a preēmpt counter overflow at 256p and above. Change the exclusion
mechanism to use atomic bit operations and busy wait loops to emulate the
spin lock exclusion mechanism but without the preempt count issues.

SGI-PV: 950027
SGI-Modid: xfs-linux-melb:xfs-kern:25338a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>


# e8234a68 13-Mar-2006 David Chinner <dgc@sgi.com>

[XFS] Add support for hotplug CPUs to the per-CPU superblock counters by
registering a notifier callback that listens to CPU up/down events to
modify the counters appropriately.

SGI-PV: 949726
SGI-Modid: xfs-linux-melb:xfs-kern:25214a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>


# 8d280b98 13-Mar-2006 David Chinner <dgc@sgi.com>

[XFS] On machines with more than 8 cpus, when running parallel I/O
threads, the incore superblock lock becomes the limiting factor for
buffered write throughput. Make the contended fields in the incore
superblock use per-cpu counters so that there is no global lock to limit
scalability.

SGI-PV: 946630
SGI-Modid: xfs-linux-melb:xfs-kern:25106a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>


# 9f4cbecd 13-Mar-2006 Nathan Scott <nathans@sgi.com>

[XFS] XFS propagates MS_NOATIME through two levels internally but doesn't
actually use it. Kill this dead code. Signed-off-by: Christoph Hellwig
<hch@lst.de>

SGI-PV: 904196
SGI-Modid: xfs-linux-melb:xfs-kern:25086a

Signed-off-by: Nathan Scott <nathans@sgi.com>


# ce8e922c 10-Jan-2006 Nathan Scott <nathans@sgi.com>

[XFS] Complete the pagebuf -> xfs_buf naming convention transition,
finally.

SGI-PV: 947038
SGI-Modid: xfs-linux-melb:xfs-kern:24866a

Signed-off-by: Nathan Scott <nathans@sgi.com>


# 13059ff0 10-Jan-2006 Nathan Scott <nathans@sgi.com>

[XFS] Reverse the sense of COMPAT_ATTR and ATTR2, keeps it simple and
consistent.

SGI-PV: 941645
SGI-Modid: xfs-linux-melb:xfs-kern:202961a

Signed-off-by: Nathan Scott <nathans@sgi.com>


# 794ee1ba 09-Jan-2006 Jes Sorensen <jes@trained-monkey.org>

[PATCH] mutex subsystem, semaphore to mutex: XFS

This patch switches XFS over to use the new mutex code directly as
opposed to the previous workaround patch I posted earlier that avoided
the namespace clash by forcing it back to semaphores. This falls in the
'works for me<tm>' category.

Signed-off-by: Jes Sorensen <jes@trained-monkey.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


# c11e2c36 01-Nov-2005 Nathan Scott <nathans@sgi.com>

[XFS] Rework fid encode/decode wrt 64 bit inums interacting with NFS.

SGI-PV: 937127
SGI-Modid: xfs-linux:xfs-kern:24201a

Signed-off-by: Nathan Scott <nathans@sgi.com>


# e718eeb4 01-Nov-2005 Nathan Scott <nathans@sgi.com>

[XFS] Rework the final mount options flag bit to make room for more.

SGI-PV: 943866
SGI-Modid: xfs-linux:xfs-kern:24030a

Signed-off-by: Nathan Scott <nathans@sgi.com>


# 7b718769 01-Nov-2005 Nathan Scott <nathans@sgi.com>

[XFS] Update license/copyright notices to match the prefered SGI
boilerplate.

SGI-PV: 913862
SGI-Modid: xfs-linux:xfs-kern:23903a

Signed-off-by: Nathan Scott <nathans@sgi.com>


# a844f451 01-Nov-2005 Nathan Scott <nathans@sgi.com>

[XFS] Remove xfs_macros.c, xfs_macros.h, rework headers a whole lot.

SGI-PV: 943122
SGI-Modid: xfs-linux:xfs-kern:23901a

Signed-off-by: Nathan Scott <nathans@sgi.com>


# fc1f8c1c 01-Nov-2005 Nathan Scott <nathans@sgi.com>

[XFS] Track external log/realtime device names for correct reporting in
/proc/mounts.

SGI-PV: 942984
SGI-Modid: xfs-linux:xfs-kern:23862a

Signed-off-by: Nathan Scott <nathans@sgi.com>


# f74dee42 01-Nov-2005 Nathan Scott <nathans@sgi.com>

[XFS] Ondisk format extension for extended attributes (attr2). Basically,
the data/attr forks now grow up/down from either end of the literal area,
rather than dividing the literal area into two chunks and growing both
upward. Means we can now make much more efficient use of the attribute
space, incl. fitting DMF attributes inline in 256 byte inodes, and large
jumps in dbench3 performance numbers. It is self enabling, but can be
forced on/off via the attr2/noattr2 mount options.

SGI-PV: 941645
SGI-Modid: xfs-linux:xfs-kern:23837a

Signed-off-by: Nathan Scott <nathans@sgi.com>


# e8c8b3a7 01-Nov-2005 David Chinner <dgc@sgi.com>

[XFS] Introduce two new mount options (nolargeio/largeio) to allow
filesystems to expose the filesystem stripe width in stat(2) rather than
the page cache size. This allows applications requiring high bandwidth to
easily determine the optimum I/O size for the underlying filesystem. The
default is to report the page cache size (i.e. "nolargeio").

SGI-PV: 942818
SGI-Modid: xfs-linux:xfs-kern:23830a

Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>


# ee34807a 01-Nov-2005 Nathan Scott <nathans@sgi.com>

[XFS] Provide a mechiansm for flushing delalloc before quota reporting.

SGI-PV: 942815
SGI-Modid: xfs-linux:xfs-kern:23829a

Signed-off-by: Nathan Scott <nathans@sgi.com>


# f538d4da 01-Nov-2005 Christoph Hellwig <hch@sgi.com>

[XFS] write barrier support Issue all log sync operations as ordered
writes. In addition flush the disk cache on fsync if the sync cached
operation didn't sync the log to disk (this requires some additional
bookeping in the transaction and log code). If the device doesn't claim to
support barriers, the filesystem has an extern log volume or the trial
superblock write with barriers enabled failed we disable barriers and
print a warning. We should probably fail the mount completely, but that
could lead to nasty boot failures for the root filesystem. Not enabled by
default yet, needs more destructive testing first.

SGI-PV: 912426
SGI-Modid: xfs-linux:xfs-kern:198723a

Signed-off-by: Christoph Hellwig <hch@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>


# c8ad20ff 20-Jun-2005 Nathan Scott <nathans@sgi.com>

[XFS] Add support for project quota, based on Dan Knappes earlier work.

SGI-PV: 932952
SGI-Modid: xfs-linux:xfs-kern:22805a

Signed-off-by: Nathan Scott <nathans@sgi.com>


# ba0f32d4 20-Jun-2005 Christoph Hellwig <hch@sgi.com>

[XFS] mark various symbols static Patch from Adrian Bunk

SGI-PV: 936255
SGI-Modid: xfs-linux:xfs-kern:192760a

Signed-off-by: Christoph Hellwig <hch@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>


# f403b7f4 05-May-2005 Nathan Scott <nathans@sgi.com>

[XFS] Cleanup use of loff_t vs xfs_off_t in the core code.

SGI Modid: xfs-linux:xfs-kern:22378a

Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Christoph Hellwig <hch@sgi.com>


# 24e17b5f 05-May-2005 Nathan Scott <nathans@sgi.com>

[XFS] Use the right offset when ensuring a delayed allocate conversion has covered the offset originally requested. Can cause data corruption when multiple processes are performing writeout on different areas of the same file. Quite difficult to hit though.

SGI Modid: xfs-linux:xfs-kern:22377a

Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Christoph Hellwig <hch@sgi.com>
.


# 1f443ad7 05-May-2005 Nathan Scott <nathans@sgi.com>

[XFS] Allow initial XFS delayed allocation size to be increased beyond 64KB.

SGI Modid: xfs-linux:xfs-kern:22261a

Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Christoph Hellwig <hch@sgi.com>


# 1da177e4 16-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org>

Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!