History log of /linux-master/fs/xfs/xfs_reflink.c
Revision Date Author Comments
# 52f80706 22-Feb-2024 Darrick J. Wong <djwong@kernel.org>

xfs: support deferred bmap updates on the attr fork

The deferred bmap update log item has always supported the attr fork, so
plumb this in so that higher layers can access this.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 1196f3f5 22-Feb-2024 Darrick J. Wong <djwong@kernel.org>

xfs: report block map corruption errors to the health tracking system

Whenever we encounter a corrupt block mapping, we should report that to
the health monitoring system for later reporting.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 3fed24ff 19-Feb-2024 Matthew Wilcox (Oracle) <willy@infradead.org>

xfs: Replace xfs_isilocked with xfs_assert_ilocked

To use the new rwsem_assert_held()/rwsem_assert_held_write(), we can't
use the existing ASSERT macro. Add a new xfs_assert_ilocked() and
convert all the callers.

Fix an apparent bug in xfs_isilocked(): If the caller specifies
XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL, xfs_assert_ilocked() will check both
the IOLOCK and the ILOCK are held for write. xfs_isilocked() only
checked that the ILOCK was held for write.

xfs_assert_ilocked() is always on, even if DEBUG or XFS_WARN aren't
defined. It's a cheap check, so I don't think it's worth defining
it away.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>


# 4c88fef3 06-Dec-2023 Darrick J. Wong <djwong@kernel.org>

xfs: remove __xfs_free_extent_later

xfs_free_extent_later is a trivial helper, so remove it to reduce the
amount of thinking required to understand the deferred freeing
interface. This will make it easier to introduce automatic reaping of
speculative allocations in the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 55f669f3 16-Oct-2023 Christoph Hellwig <hch@lst.de>

xfs: only remap the written blocks in xfs_reflink_end_cow_extent

xfs_reflink_end_cow_extent looks up the COW extent and the data fork
extent at offset_fsb, and then proceeds to remap the common subset
between the two.

It does however not limit the remapped extent to the passed in
[*offset_fsbm end_fsb] range and thus potentially remaps more blocks than
the one handled by the current I/O completion. This means that with
sufficiently large data and COW extents we could be remapping COW fork
mappings that have not been written to, leading to a stale data exposure
on a powerfail event.

We use to have a xfs_trim_range to make the remap fit the I/O completion
range, but that got (apparently accidentally) removed in commit
df2fd88f8ac7 ("xfs: rewrite xfs_reflink_end_cow to use intents").

Note that I've only found this by code inspection, and a test case would
probably require very specific delay and error injection.

Fixes: df2fd88f8ac7 ("xfs: rewrite xfs_reflink_end_cow to use intents")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>


# 14a53798 17-Oct-2023 Catherine Hoang <catherine.hoang@oracle.com>

xfs: allow read IO and FICLONE to run concurrently

One of our VM cluster management products needs to snapshot KVM image
files so that they can be restored in case of failure. Snapshotting is
done by redirecting VM disk writes to a sidecar file and using reflink
on the disk image, specifically the FICLONE ioctl as used by
"cp --reflink". Reflink locks the source and destination files while it
operates, which means that reads from the main vm disk image are blocked,
causing the vm to stall. When an image file is heavily fragmented, the
copy process could take several minutes. Some of the vm image files have
50-100 million extent records, and duplicating that much metadata locks
the file for 30 minutes or more. Having activities suspended for such
a long time in a cluster node could result in node eviction.

Clone operations and read IO do not change any data in the source file,
so they should be able to run concurrently. Demote the exclusive locks
taken by FICLONE to shared locks to allow reads while cloning. While a
clone is in progress, writes will take the IOLOCK_EXCL, so they block
until the clone completes.

Link: https://lore.kernel.org/linux-xfs/8911B94D-DD29-4D6E-B5BC-32EAF1866245@oracle.com/
Signed-off-by: Catherine Hoang <catherine.hoang@oracle.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>


# b742d7b4 28-Jun-2023 Dave Chinner <dchinner@redhat.com>

xfs: use deferred frees for btree block freeing

Btrees that aren't freespace management trees use the normal extent
allocation and freeing routines for their blocks. Hence when a btree
block is freed, a direct call to xfs_free_extent() is made and the
extent is immediately freed. This puts the entire free space
management btrees under this path, so we are stacking btrees on
btrees in the call stack. The inobt, finobt and refcount btrees
all do this.

However, the bmap btree does not do this - it calls
xfs_free_extent_later() to defer the extent free operation via an
XEFI and hence it gets processed in deferred operation processing
during the commit of the primary transaction (i.e. via intent
chaining).

We need to change xfs_free_extent() to behave in a non-blocking
manner so that we can avoid deadlocks with busy extents near ENOSPC
in transactions that free multiple extents. Inserting or removing a
record from a btree can cause a multi-level tree merge operation and
that will free multiple blocks from the btree in a single
transaction. i.e. we can call xfs_free_extent() multiple times, and
hence the btree manipulation transaction is vulnerable to this busy
extent deadlock vector.

To fix this, convert all the remaining callers of xfs_free_extent()
to use xfs_free_extent_later() to queue XEFIs and hence defer
processing of the extent frees to a context that can be safely
restarted if a deadlock condition is detected.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>


# 7dfee17b 04-Jun-2023 Dave Chinner <dchinner@redhat.com>

xfs: validate block number being freed before adding to xefi

Bad things happen in defered extent freeing operations if it is
passed a bad block number in the xefi. This can come from a bogus
agno/agbno pair from deferred agfl freeing, or just a bad fsbno
being passed to __xfs_free_extent_later(). Either way, it's very
difficult to diagnose where a null perag oops in EFI creation
is coming from when the operation that queued the xefi has already
been completed and there's no longer any trace of it around....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# c4d5660a 12-Feb-2023 Dave Chinner <dchinner@redhat.com>

xfs: active perag reference counting

We need to be able to dynamically remove instantiated AGs from
memory safely, either for shrinking the filesystem or paging AG
state in and out of memory (e.g. supporting millions of AGs). This
means we need to be able to safely exclude operations from accessing
perags while dynamic removal is in progress.

To do this, introduce the concept of active and passive references.
Active references are required for high level operations that make
use of an AG for a given operation (e.g. allocation) and pin the
perag in memory for the duration of the operation that is operating
on the perag (e.g. transaction scope). This means we can fail to get
an active reference to an AG, hence callers of the new active
reference API must be able to handle lookup failure gracefully.

Passive references are used in low level code, where we might need
to access the perag structure for the purposes of completing high
level operations. For example, buffers need to use passive
references because:
- we need to be able to do metadata IO during operations like grow
and shrink transactions where high level active references to the
AG have already been blocked
- buffers need to pin the perag until they are reclaimed from
memory, something that high level code has no direct control over.
- unused cached buffers should not prevent a shrink from being
started.

Hence we have active references that will form exclusion barriers
for operations to be performed on an AG, and passive references that
will prevent reclaim of the perag until all objects with passive
references have been reclaimed themselves.

This patch introduce xfs_perag_grab()/xfs_perag_rele() as the API
for active AG reference functionality. We also need to convert the
for_each_perag*() iterators to use active references, which will
start the process of converting high level code over to using active
references. Conversion of non-iterator based code to active
references will be done in followup patches.

Note that the implementation using reference counting is really just
a development vehicle for the API to ensure we don't have any leaks
in the callers. Once we need to remove perag structures from memory
dyanmically, we will need a much more robust per-ag state transition
mechanism for preventing new references from being taken while we
wait for existing references to drain before removal from memory can
occur....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>


# 692b6cdd 10-Feb-2023 Dave Chinner <dchinner@redhat.com>

xfs: t_firstblock is tracking AGs not blocks

The tp->t_firstblock field is now raelly tracking the highest AG we
have locked, not the block number of the highest allocation we've
made. It's purpose is to prevent AGF locking deadlocks, so rename it
to "highest AG" and simplify the implementation to just track the
agno rather than a fsbno.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>


# 26870c3f 26-Dec-2022 Darrick J. Wong <djwong@kernel.org>

xfs: don't assert if cmap covers imap after cycling lock

In xfs_reflink_fill_cow_hole, there's a debugging assertion that trips
if (after cycling the ILOCK to get a transaction) the requeried cow
mapping overlaps the start of the area being written. IOWs, it trips if
the hole in the cow fork that it's supposed to fill has been filled.

This is trivially possible since we cycled ILOCK_EXCL. If we trip the
assertion, then we know that cmap is a delalloc extent because @found is
false. Fortunately, the bmapi_write call below will convert the
delalloc extent to a real unwritten cow fork extent, so all we need to
do here is remove the assertion.

It turns out that generic/095 trips this pretty regularly with alwayscow
mode enabled.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>


# d984648e 01-Dec-2022 Shiyang Ruan <ruansy.fnst@fujitsu.com>

fsdax,xfs: port unshare to fsdax

Implement unshare in fsdax mode: copy data from srcmap to iomap.

Link: https://lkml.kernel.org/r/1669908753-169-1-git-send-email-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# a0ebf8c4 18-Sep-2022 Zeng Heng <zengheng4@huawei.com>

xfs: simplify if-else condition in xfs_reflink_trim_around_shared

"else" is not generally useful after a return,
so remove it for clean code.

There is no logical changes.

Signed-off-by: Zeng Heng <zengheng4@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# d6211330 04-Aug-2022 Chandan Babu R <chandan.babu@oracle.com>

xfs: Fix false ENOSPC when performing direct write on a delalloc extent in cow fork

On a higly fragmented filesystem a Direct IO write can fail with -ENOSPC error
even though the filesystem has sufficient number of free blocks.

This occurs if the file offset range on which the write operation is being
performed has a delalloc extent in the cow fork and this delalloc extent
begins much before the Direct IO range.

In such a scenario, xfs_reflink_allocate_cow() invokes xfs_bmapi_write() to
allocate the blocks mapped by the delalloc extent. The extent thus allocated
may not cover the beginning of file offset range on which the Direct IO write
was issued. Hence xfs_reflink_allocate_cow() ends up returning -ENOSPC.

The following script reliably recreates the bug described above.

#!/usr/bin/bash

device=/dev/loop0
shortdev=$(basename $device)

mntpnt=/mnt/
file1=${mntpnt}/file1
file2=${mntpnt}/file2
fragmentedfile=${mntpnt}/fragmentedfile
punchprog=/root/repos/xfstests-dev/src/punch-alternating

errortag=/sys/fs/xfs/${shortdev}/errortag/bmap_alloc_minlen_extent

umount $device > /dev/null 2>&1

echo "Create FS"
mkfs.xfs -f -m reflink=1 $device > /dev/null 2>&1
if [[ $? != 0 ]]; then
echo "mkfs failed."
exit 1
fi

echo "Mount FS"
mount $device $mntpnt > /dev/null 2>&1
if [[ $? != 0 ]]; then
echo "mount failed."
exit 1
fi

echo "Create source file"
xfs_io -f -c "pwrite 0 32M" $file1 > /dev/null 2>&1

sync

echo "Create Reflinked file"
xfs_io -f -c "reflink $file1" $file2 &>/dev/null

echo "Set cowextsize"
xfs_io -c "cowextsize 16M" $file1 > /dev/null 2>&1

echo "Fragment FS"
xfs_io -f -c "pwrite 0 64M" $fragmentedfile > /dev/null 2>&1
sync
$punchprog $fragmentedfile

echo "Allocate block sized extent from now onwards"
echo -n 1 > $errortag

echo "Create 16MiB delalloc extent in CoW fork"
xfs_io -c "pwrite 0 4k" $file1 > /dev/null 2>&1

sync

echo "Direct I/O write at offset 12k"
xfs_io -d -c "pwrite 12k 8k" $file1

This commit fixes the bug by invoking xfs_bmapi_write() in a loop until disk
blocks are allocated for atleast the starting file offset of the Direct IO
write range.

Fixes: 3c68d44a2b49 ("xfs: allocate direct I/O COW blocks in iomap_begin")
Reported-and-Root-caused-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: slight editing to make the locking less grody, and fix some style things]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>


# 13f9e267 02-Jun-2022 Shiyang Ruan <ruansy.fnst@fujitsu.com>

xfs: add dax dedupe support

Introduce xfs_mmaplock_two_inodes_and_break_dax_layout() for dax files who
are going to be deduped. After that, call compare range function only
when files are both DAX or not.

Link: https://lkml.kernel.org/r/20220603053738.1218681-15-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.wiliams@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 6f7db389 02-Jun-2022 Shiyang Ruan <ruansy.fnst@fujitsu.com>

fsdax: dedup file range to use a compare function

With dax we cannot deal with readpage() etc. So, we create a dax
comparison function which is similar with vfs_dedupe_file_range_compare().
And introduce dax_remap_file_range_prep() for filesystem use.

Link: https://lkml.kernel.org/r/20220603053738.1218681-13-ruansy.fnst@fujitsu.com
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.wiliams@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 732436ef 09-Jul-2022 Darrick J. Wong <djwong@kernel.org>

xfs: convert XFS_IFORK_PTR to a static inline helper

We're about to make this logic do a bit more, so convert the macro to a
static inline function for better typechecking and fewer shouty macros.
No functional changes here.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>


# 08d3e84f 07-Jul-2022 Dave Chinner <dchinner@redhat.com>

xfs: pass perag to xfs_alloc_read_agf()

xfs_alloc_read_agf() initialises the perag if it hasn't been done
yet, so it makes sense to pass it the perag rather than pull a
reference from the buffer. This allows callers to be per-ag centric
rather than passing mount/agno pairs everywhere.

Whilst modifying the xfs_reflink_find_shared() function definition,
declare it static and remove the extern declaration as it is an
internal function only these days.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>


# df2fd88f 25-Apr-2022 Darrick J. Wong <djwong@kernel.org>

xfs: rewrite xfs_reflink_end_cow to use intents

Currently, the code that performs CoW remapping after a write has this
odd behavior where it walks /backwards/ through the data fork to remap
extents in reverse order. Earlier, we rewrote the reflink remap
function to use deferred bmap log items instead of trying to cram as
much into the first transaction that we could. Now do the same for the
CoW remap code. There doesn't seem to be any performance impact; we're
just making better use of code that we added for the benefit of reflink.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# f1e6a8d7 25-Apr-2022 Darrick J. Wong <djwong@kernel.org>

xfs: remove a __xfs_bunmapi call from reflink

This raw call isn't necessary since we can always remove a full delalloc
extent.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 4f86bb4b 09-Mar-2022 Chandan Babu R <chandan.babu@oracle.com>

xfs: Conditionally upgrade existing inodes to use large extent counters

This commit enables upgrading existing inodes to use large extent counters
provided that underlying filesystem's superblock has large extent counter
feature enabled.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>


# 1a39ae41 25-Feb-2022 Gao Xiang <hsiangkao@linux.alibaba.com>

xfs: add missing cmap->br_state = XFS_EXT_NORM update

COW extents are already converted into written real extents after
xfs_reflink_convert_cow_locked(), therefore cmap->br_state should
reflect it.

Otherwise, there is another necessary unwritten convertion
triggered in xfs_dio_write_end_io() for direct I/O cases.

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>


# f1ba5faf 29-Nov-2021 Shiyang Ruan <ruansy.fnst@fujitsu.com>

xfs: add xfs_zero_range and xfs_truncate_page helpers

Add helpers to prepare for using different DAX operations.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
[hch: split from a larger patch + slight cleanups]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-16-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>


# 7993f1a4 15-Dec-2021 Darrick J. Wong <djwong@kernel.org>

xfs: only run COW extent recovery when there are no live extents

As part of multiple customer escalations due to file data corruption
after copy on write operations, I wrote some fstests that use fsstress
to hammer on COW to shake things loose. Regrettably, I caught some
filesystem shutdowns due to incorrect rmap operations with the following
loop:

mount <filesystem> # (0)
fsstress <run only readonly ops> & # (1)
while true; do
fsstress <run all ops>
mount -o remount,ro # (2)
fsstress <run only readonly ops>
mount -o remount,rw # (3)
done

When (2) happens, notice that (1) is still running. xfs_remount_ro will
call xfs_blockgc_stop to walk the inode cache to free all the COW
extents, but the blockgc mechanism races with (1)'s reader threads to
take IOLOCKs and loses, which means that it doesn't clean them all out.
Call such a file (A).

When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
walks the ondisk refcount btree and frees any COW extent that it finds.
This function does not check the inode cache, which means that incore
COW forks of inode (A) is now inconsistent with the ondisk metadata. If
one of those former COW extents are allocated and mapped into another
file (B) and someone triggers a COW to the stale reservation in (A), A's
dirty data will be written into (B) and once that's done, those blocks
will be transferred to (A)'s data fork without bumping the refcount.

The results are catastrophic -- file (B) and the refcount btree are now
corrupt. In the first patch, we fixed the race condition in (2) so that
(A) will always flush the COW fork. In this second patch, we move the
_recover_cow call to the initial mount call in (0) for safety.

As mentioned previously, xfs_reflink_recover_cow walks the refcount
btree looking for COW staging extents, and frees them. This was
intended to be run at mount time (when we know there are no live inodes)
to clean up any leftover staging events that may have been left behind
during an unclean shutdown. As a time "optimization" for readonly
mounts, we deferred this to the ro->rw transition, not realizing that
any failure to clean all COW forks during a rw->ro transition would
result in catastrophic corruption.

Therefore, remove this optimization and only run the recovery routine
when we're guaranteed not to have any COW staging extents anywhere,
which means we always run this at mount time. While we're at it, move
the callsite to xfs_log_mount_finish because any refcount btree
expansion (however unlikely given that we're removing records from the
right side of the index) must be fed by a per-AG reservation, which
doesn't exist in its current location.

Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>


# c201d9ca 12-Oct-2021 Darrick J. Wong <djwong@kernel.org>

xfs: rename xfs_bmap_add_free to xfs_free_extent_later

xfs_bmap_add_free isn't a block mapping function; it schedules deferred
freeing operations for a later point in a compound transaction chain.
While it's primarily used by bunmapi, its use has expanded beyond that.
Move it to xfs_alloc.c and rename the function since it's now general
freeing functionality. Bring the slab cache bits in line with the
way we handle the other intent items.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>


# 38c26bfd 18-Aug-2021 Dave Chinner <dchinner@redhat.com>

xfs: replace xfs_sb_version checks with feature flag checks

Convert the xfs_sb_version_hasfoo() to checks against
mp->m_features. Checks of the superblock itself during disk
operations (e.g. in the read/write verifiers and the to/from disk
formatters) are not converted - they operate purely on the
superblock state. Everything else should use the mount features.

Large parts of this conversion were done with sed with commands like
this:

for f in `git grep -l xfs_sb_version_has fs/xfs/*.c`; do
sed -i -e 's/xfs_sb_version_has\(.*\)(&\(.*\)->m_sb)/xfs_has_\1(\2)/' $f
done

With manual cleanups for things like "xfs_has_extflgbit" and other
little inconsistencies in naming.

The result is ia lot less typing to check features and an XFS binary
size reduced by a bit over 3kB:

$ size -t fs/xfs/built-in.a
text data bss dec hex filenam
before 1130866 311352 484 1442702 16038e (TOTALS)
after 1127727 311352 484 1439563 15f74b (TOTALS)

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>


# a81a0621 01-Jun-2021 Dave Chinner <dchinner@redhat.com>

xfs: convert refcount btree cursor to use perags

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>


# be9fb17d 01-Jun-2021 Dave Chinner <dchinner@redhat.com>

xfs: add a perag to the btree cursor

Which will eventually completely replace the agno in it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 934933c3 01-Jun-2021 Dave Chinner <dchinner@redhat.com>

xfs: convert raw ag walks to use for_each_perag

Convert the raw walks to an iterator, pulling the current AG out of
pag->pag_agno instead of the loop iterator variable.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>


# 9bbafc71 01-Jun-2021 Dave Chinner <dchinner@redhat.com>

xfs: move xfs_perag_get/put to xfs_ag.[ch]

They are AG functions, not superblock functions, so move them to the
appropriate location.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>


# d4f74e16 28-Apr-2021 Darrick J. Wong <djwong@kernel.org>

xfs: fix xfs_reflink_unshare usage of filemap_write_and_wait_range

The final parameter of filemap_write_and_wait_range is the end of the
range to flush, not the length of the range to flush.

Fixes: 46afb0628b86 ("xfs: only flush the unshared range in xfs_reflink_unshare")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 862a804a 13-Apr-2021 Christoph Hellwig <hch@lst.de>

xfs: move the XFS_IFEXTENTS check into xfs_iread_extents

Move the XFS_IFEXTENTS check from the callers into xfs_iread_extents to
simplify the code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>


# 3e09ab8f 29-Mar-2021 Christoph Hellwig <hch@lst.de>

xfs: move the di_flags2 field to struct xfs_inode

In preparation of removing the historic icinode struct, move the flags2
field into the containing xfs_inode structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>


# b33ce57d 29-Mar-2021 Christoph Hellwig <hch@lst.de>

xfs: move the di_cowextsize field to struct xfs_inode

In preparation of removing the historic icinode struct, move the
cowextsize field into the containing xfs_inode structure. Also
switch to use the xfs_extlen_t instead of a uint32_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>


# 13d2c10b 29-Mar-2021 Christoph Hellwig <hch@lst.de>

xfs: move the di_size field to struct xfs_inode

In preparation of removing the historic icinode struct, move the on-disk
size field into the containing xfs_inode structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>


# 766aabd5 22-Jan-2021 Darrick J. Wong <djwong@kernel.org>

xfs: flush eof/cowblocks if we can't reserve quota for file blocks

If a fs modification (data write, reflink, xattr set, fallocate, etc.)
is unable to reserve enough quota to handle the modification, try
clearing whatever space the filesystem might have been hanging onto in
the hopes of speeding up the filesystem. The flushing behavior will
become particularly important when we add deferred inode inactivation
because that will increase the amount of space that isn't actively tied
to user data.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 4ca74205 27-Jan-2021 Darrick J. Wong <djwong@kernel.org>

xfs: try worst case space reservation upfront in xfs_reflink_remap_extent

Now that we've converted xfs_reflink_remap_extent to use the new
xfs_trans_alloc_inode API, we can focus on its slightly unusual behavior
with regard to quota reservations.

Since it's valid to remap written blocks into a hole, we must be able to
increase the quota count by the number of blocks in the mapping.
However, the incore space reservation process requires us to supply an
asymptotic guess before we can gain exclusive access to resources. We'd
like to reserve all the quota we need up front, but we also don't want
to fail a written -> allocated remap operation unnecessarily.

The solution is to make the remap_extents function call the transaction
allocation function twice. The first time we ask to reserve enough
space and quota to handle the absolute worst case situation, but if that
fails, we can fall back to the old strategy: ask for the bare minimum
space reservation upfront and increase the quota reservation later if we
need to.

Later in this patchset we change the transaction and quota code to try
to reclaim space if we cannot reserve free space or quota.
Restructuring the remap_extent function in this manner means that if the
fallback increase fails, we can pass that back to the caller knowing
that the transaction allocation already tried freeing space.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# f273387b 27-Jan-2021 Darrick J. Wong <djwong@kernel.org>

xfs: refactor reflink functions to use xfs_trans_alloc_inode

The two remaining callers of xfs_trans_reserve_quota_nblks are in the
reflink code. These conversions aren't as uniform as the previous
conversions, so call that out in a separate patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 02b7ee4e 26-Jan-2021 Darrick J. Wong <djwong@kernel.org>

xfs: reserve data and rt quota at the same time

Modify xfs_trans_reserve_quota_nblks so that we can reserve data and
realtime blocks from the dquot at the same time. This change has the
theoretical side effect that for allocations to realtime files we will
reserve from the dquot both the number of rtblocks being allocated and
the number of bmbt blocks that might be needed to add the mapping.
However, since the mount code disables quota if it finds a realtime
device, this should not result in any behavior changes.

Now that we've moved the inode creation callers away from using the
_nblks function, we can repurpose the (now unused) ninos argument for
realtime blocks, so make that change. This also replaces the flags
argument with a boolean parameter to force the reservation since we
don't need to distinguish between data and rt quota reservations any
more, and the only flag being passed in was FORCE_RES.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 35b11010 26-Jan-2021 Darrick J. Wong <djwong@kernel.org>

xfs: remove xfs_trans_unreserve_quota_nblks completely

xfs_trans_cancel will release all the quota resources that were reserved
on behalf of the transaction, so get rid of the explicit unreserve step.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 85546500 22-Jan-2021 Darrick J. Wong <djwong@kernel.org>

xfs: create convenience wrappers for incore quota block reservations

Create a couple of convenience wrappers for creating and deleting quota
block reservations against future changes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 4abe21ad 22-Jan-2021 Darrick J. Wong <djwong@kernel.org>

xfs: clean up quota reservation callsites

Convert a few xfs_trans_*reserve* callsites that are open-coding other
convenience functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# ee898d78 22-Jan-2021 Chandan Babu R <chandanrlinux@gmail.com>

xfs: Check for extent overflow when remapping an extent

Remapping an extent involves unmapping the existing extent and mapping
in the new extent. When unmapping, an extent containing the entire unmap
range can be split into two extents,
i.e. | Old extent | hole | Old extent |
Hence extent count increases by 1.

Mapping in the new extent into the destination file can increase the
extent count by 1.

Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 5f1d5bbf 22-Jan-2021 Chandan Babu R <chandanrlinux@gmail.com>

xfs: Check for extent overflow when moving extent from cow to data fork

Moving an extent to data fork can cause a sub-interval of an existing
extent to be unmapped. This will increase extent count by 1. Mapping in
the new extent can increase the extent count by 1 again i.e.
| Old extent | New extent | Old extent |
Hence number of extents increases by 2.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 46afb062 02-Nov-2020 Darrick J. Wong <darrick.wong@oracle.com>

xfs: only flush the unshared range in xfs_reflink_unshare

There's no reason to flush an entire file when we're unsharing part of
a file. Therefore, only initiate writeback on the selected range.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>


# b63da6c8 05-Aug-2020 Randy Dunlap <rdunlap@infradead.org>

xfs: delete duplicated words + other fixes

Delete repeated words in fs/xfs/.
{we, that, the, a, to, fork}
Change "it it" to "it is" in one location.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
To: linux-fsdevel@vger.kernel.org
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# e2aaee9c 29-Jun-2020 Darrick J. Wong <darrick.wong@oracle.com>

xfs: move helpers that lock and unlock two inodes against userspace IO

Move the double-inode locking helpers to xfs_inode.c since they're not
specific to reflink.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 10b4bd6c 29-Jun-2020 Darrick J. Wong <darrick.wong@oracle.com>

xfs: refactor locking and unlocking two inodes against userspace IO

Refactor the two functions that we use to lock and unlock two inodes to
block userspace from initiating IO against a file, whether via system
calls or mmap activity.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 451d34ee 29-Jun-2020 Darrick J. Wong <darrick.wong@oracle.com>

xfs: fix xfs_reflink_remap_prep calling conventions

Fix the return value of xfs_reflink_remap_prep so that its return value
conventions match the rest of xfs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 168eae80 29-Jun-2020 Darrick J. Wong <darrick.wong@oracle.com>

xfs: reflink can skip remap existing mappings

If the source and destination map are identical, we can skip the remap
step to save some time.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 94b941fd 29-Jun-2020 Darrick J. Wong <darrick.wong@oracle.com>

xfs: only reserve quota blocks if we're mapping into a hole

When logging quota block count updates during a reflink operation, we
only log the /delta/ of the block count changes to the dquot. Since we
now know ahead of time the extent type of both dmap and smap (and that
they have the same length), we know that we only need to reserve quota
blocks for dmap's blockcount if we're mapping it into a hole.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# aa5d0ba0 29-Jun-2020 Darrick J. Wong <darrick.wong@oracle.com>

xfs: only reserve quota blocks for bmbt changes if we're changing the data fork

Now that we've reworked xfs_reflink_remap_extent to remap only one
extent per transaction, we actually know if the extent being removed is
an allocated mapping. This means that we now know ahead of time if
we're going to be touching the data fork.

Since we only need blocks for a bmbt split if we're going to update the
data fork, we only need to get quota reservation if we know we're going
to touch the data fork.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 00fd1d56 29-Jun-2020 Darrick J. Wong <darrick.wong@oracle.com>

xfs: redesign the reflink remap loop to fix blkres depletion crash

The existing reflink remapping loop has some structural problems that
need addressing:

The biggest problem is that we create one transaction for each extent in
the source file without accounting for the number of mappings there are
for the same range in the destination file. In other words, we don't
know the number of remap operations that will be necessary and we
therefore cannot guess the block reservation required. On highly
fragmented filesystems (e.g. ones with active dedupe) we guess wrong,
run out of block reservation, and fail.

The second problem is that we don't actually use the bmap intents to
their full potential -- instead of calling bunmapi directly and having
to deal with its backwards operation, we could call the deferred ops
xfs_bmap_unmap_extent and xfs_refcount_decrease_extent instead. This
makes the frontend loop much simpler.

Solve all of these problems by refactoring the remapping loops so that
we only perform one remapping operation per transaction, and each
operation only tries to remap a single extent from source to dest.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reported-by: Edwin Török <edwin@etorok.net>
Tested-by: Edwin Török <edwin@etorok.net>


# 877f58f5 29-Jun-2020 Darrick J. Wong <darrick.wong@oracle.com>

xfs: rename xfs_bmap_is_real_extent to is_written_extent

The name of this predicate is a little misleading -- it decides if the
extent mapping is allocated and written. Change the name to be more
direct, as we're going to add a new predicate in the next patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 83895227 29-Jun-2020 Darrick J. Wong <darrick.wong@oracle.com>

xfs: fix reflink quota reservation accounting error

Quota reservations are supposed to account for the blocks that might be
allocated due to a bmap btree split. Reflink doesn't do this, so fix
this to make the quota accounting more accurate before we start
rearranging things.

Fixes: 862bb360ef56 ("xfs: reflink extents from one file to another")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# c142932c 12-Apr-2020 Darrick J. Wong <darrick.wong@oracle.com>

xfs: fix partially uninitialized structure in xfs_reflink_remap_extent

In the reflink extent remap function, it turns out that uirec (the block
mapping corresponding only to the part of the passed-in mapping that got
unmapped) was not fully initialized. Specifically, br_state was not
being copied from the passed-in struct to the uirec. This could lead to
unpredictable results such as the reflinked mapping being marked
unwritten in the destination file.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 706b8c5b 23-Jan-2020 Darrick J. Wong <darrick.wong@oracle.com>

xfs: remove unnecessary null pointer checks from _read_agf callers

Drop the null buffer pointer checks in all code that calls
xfs_alloc_read_agf and doesn't pass XFS_ALLOC_FLAG_TRYLOCK because
they're no longer necessary.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>


# aa124436 20-Jan-2020 zhengbin <zhengbin13@huawei.com>

xfs: change return value of xfs_inode_need_cow to int

Fixes coccicheck warning:

fs/xfs/xfs_reflink.c:236:9-10: WARNING: return of 0/1 in function 'xfs_inode_need_cow' with return type bool

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: zhengbin <zhengbin13@huawei.com>
[darrick: rename the function so it doesn't sound like a predicate]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# a5084865 02-Jan-2020 Darrick J. Wong <darrick.wong@oracle.com>

xfs: introduce XFS_MAX_FILEOFF

Introduce a new #define for the maximum supported file block offset.
We'll use this in the next patch to make it more obvious that we're
doing some operation for all possible inode fork mappings after a given
offset. We can't use ULLONG_MAX here because bunmapi uses that to
detect when it's done.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# da781e64 21-Oct-2019 Brian Foster <bfoster@redhat.com>

xfs: don't set bmapi total block req where minleft is

xfs_bmapi_write() takes a total block requirement parameter that is
passed down to the block allocation code and is used to specify the
total block requirement of the associated transaction. This is used
to try and select an AG that can not only satisfy the requested
extent allocation, but can also accommodate subsequent allocations
that might be required to complete the transaction. For example,
additional bmbt block allocations may be required on insertion of
the resulting extent to an inode data fork.

While it's important for callers to calculate and reserve such extra
blocks in the transaction, it is not necessary to pass the total
value to xfs_bmapi_write() in all cases. The latter automatically
sets minleft to ensure that sufficient free blocks remain after the
allocation attempt to expand the format of the associated inode
(i.e., such as extent to btree conversion, btree splits, etc).
Therefore, any callers that pass a total block requirement of the
bmap mapping length plus worst case bmbt expansion essentially
specify the additional reservation requirement twice. These callers
can pass a total of zero to rely on the bmapi minleft policy.

Beyond being superfluous, the primary motivation for this change is
that the total reservation logic in the bmbt code is dubious in
scenarios where minlen < maxlen and a maxlen extent cannot be
allocated (which is more common for data extent allocations where
contiguity is not required). The total value is based on maxlen in
the xfs_bmapi_write() caller. If the bmbt code falls back to an
allocation between minlen and maxlen, that allocation will not
succeed until total is reset to minlen, which essentially throws
away any additional reservation included in total by the caller. In
addition, the total value is not reset until after alignment is
dropped, which means that such callers drop alignment far too
aggressively than necessary.

Update all callers of xfs_bmapi_write() that pass a total block
value of the mapping length plus bmbt reservation to instead pass
zero and rely on xfs_bmapi_minleft() to enforce the bmbt reservation
requirement. This trades off slightly less conservative AG selection
for the ability to preserve alignment in more scenarios.
xfs_bmapi_write() callers that incorporate unrelated or additional
reservations in total beyond what is already included in minleft
must continue to use the former.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# f150b423 19-Oct-2019 Christoph Hellwig <hch@lst.de>

xfs: split the iomap ops for buffered vs direct writes

Instead of lots of magic conditionals in the main write_begin
handler this make the intent very clear. Thing will become even
better once we support delayed allocations for extent size hints
and realtime allocations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# ffb375a8 19-Oct-2019 Christoph Hellwig <hch@lst.de>

xfs: pass two imaps to xfs_reflink_allocate_cow

xfs_reflink_allocate_cow consumes the source data fork imap, and
potentially returns the COW fork imap. Split the arguments in two
to clear up the calling conventions and to prepare for returning
a source iomap from ->iomap_begin.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# dd26b846 19-Oct-2019 Christoph Hellwig <hch@lst.de>

xfs: remove xfs_reflink_dirty_extents

Now that xfs_file_unshare is not completely dumb we can just call it
directly without iterating the extent and reflink btrees ourselves.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 3590c4d8 18-Oct-2019 Christoph Hellwig <hch@lst.de>

iomap: ignore non-shared or non-data blocks in xfs_file_dirty

xfs_file_dirty is used to unshare reflink blocks. Rename the function
to xfs_file_unshare to better document that purpose, and skip iomaps
that are not shared and don't need zeroing. This will allow to simplify
the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 3e08f42a 26-Aug-2019 Darrick J. Wong <darrick.wong@oracle.com>

xfs: remove unnecessary int returns from deferred bmap functions

Remove the return value from the functions that schedule deferred bmap
operations since they never fail and do not return status.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>


# 74b4c5d4 26-Aug-2019 Darrick J. Wong <darrick.wong@oracle.com>

xfs: remove unnecessary int returns from deferred refcount functions

Remove the return value from the functions that schedule deferred
refcount operations since they never fail and do not return status.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>


# 5d888b48 14-Aug-2019 Darrick J. Wong <darrick.wong@oracle.com>

xfs: fix reflink source file racing with directio writes

While trawling through the dedupe file comparison code trying to fix
page deadlocking problems, Dave Chinner noticed that the reflink code
only takes shared IOLOCK/MMAPLOCKs on the source file. Because
page_mkwrite and directio writes do not take the EXCL versions of those
locks, this means that reflink can race with writer processes.

For pure remapping this can lead to undefined behavior and file
corruption; for dedupe this means that we cannot be sure that the
contents are identical when we decide to go ahead with the remapping.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 73d30d48 28-Jun-2019 Christoph Hellwig <hch@lst.de>

xfs: remove XFS_TRANS_NOFS

Instead of a magic flag for xfs_trans_alloc, just ensure all callers
that can't relclaim through the file system use memalloc_nofs_save to
set the per-task nofs flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 250d4b4c 28-Jun-2019 Eric Sandeen <sandeen@sandeen.net>

xfs: remove unused header files

There are many, many xfs header files which are included but
unneeded (or included twice) in the xfs code, so remove them.

nb: xfs_linux.h includes about 9 headers for everyone, so those
explicit includes get removed by this. I'm not sure what the
preference is, but if we wanted explicit includes everywhere,
a followup patch could remove those xfs_*.h includes from
xfs_linux.h and move them into the files that need them.
Or it could be left as-is.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# c1a4447f 25-Feb-2019 Darrick J. Wong <darrick.wong@oracle.com>

xfs: fix uninitialized error variables

smatch complained about some uninitialized error returns, so fix those.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>


# affe250a 21-Feb-2019 Darrick J. Wong <darrick.wong@oracle.com>

xfs: don't pass iomap flags to xfs_reflink_allocate_cow

Don't pass raw iomap flags to xfs_reflink_allocate_cow; signal our
intention with a boolean argument.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 66ae56a5 18-Feb-2019 Christoph Hellwig <hch@lst.de>

xfs: introduce an always_cow mode

Add a mode where XFS never overwrites existing blocks in place. This
is to aid debugging our COW code, and also put infatructure in place
for things like possible future support for zoned block devices, which
can't support overwrites.

This mode is enabled globally by doing a:

echo 1 > /sys/fs/xfs/debug/always_cow

Note that the parameter is global to allow running all tests in xfstests
easily in this mode, which would not easily be possible with a per-fs
sysfs file.

In always_cow mode persistent preallocations are disabled, and fallocate
will fail when called with a 0 mode (with our without
FALLOC_FL_KEEP_SIZE), and not create unwritten extent for zeroed space
when called with FALLOC_FL_ZERO_RANGE or FALLOC_FL_UNSHARE_RANGE.

There are a few interesting xfstests failures when run in always_cow
mode:

- generic/392 fails because the bytes used in the file used to test
hole punch recovery are less after the log replay. This is
because the blocks written and then punched out are only freed
with a delay due to the logging mechanism.
- xfs/170 will fail as the already fragile file streams mechanism
doesn't seem to interact well with the COW allocator
- xfs/180 xfs/182 xfs/192 xfs/198 xfs/204 and xfs/208 will claim
the file system is badly fragmented, but there is not much we
can do to avoid that when always writing out of place
- xfs/205 fails because overwriting a file in always_cow mode
will require new space allocation and the assumption in the
test thus don't work anymore.
- xfs/326 fails to modify the file at all in always_cow mode after
injecting the refcount error, leading to an unexpected md5sum
after the remount, but that again is expected

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 26b91c72 18-Feb-2019 Christoph Hellwig <hch@lst.de>

xfs: make COW fork unwritten extent conversions more robust

If we have racing buffered and direct I/O COW fork extents under
writeback can have been moved to the data fork by the time we call
xfs_reflink_convert_cow from xfs_submit_ioend. This would be mostly
harmless as the block numbers don't change by this move, except for
the fact that xfs_bmapi_write will crash or trigger asserts when
not finding existing extents, even despite trying to paper over this
with the XFS_BMAPI_CONVERT_ONLY flag.

Instead of special casing non-transaction conversions in the already
way too complicated xfs_bmapi_write just add a new helper for the much
simpler non-transactional COW fork case, which simplify ignores not
found extents.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# db46e604 18-Feb-2019 Christoph Hellwig <hch@lst.de>

xfs: merge COW handling into xfs_file_iomap_begin_delay

Besides simplifying the code a bit this allows to actually implement
the behavior of using COW preallocation for non-COW data mentioned
in the current comments.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 78f0cc9d 18-Feb-2019 Christoph Hellwig <hch@lst.de>

xfs: don't use delalloc extents for COW on files with extsize hints

While using delalloc for extsize hints is generally a good idea, the
current code that does so only for COW doesn't help us much and creates
a lot of special cases. Switch it to use real allocations like we
do for direct I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# be225fec 15-Feb-2019 Christoph Hellwig <hch@lst.de>

xfs: remove the io_type field from the writeback context and ioend

The io_type field contains what is basically a summary of information
from the inode fork and the imap. But we can just as easily use that
information directly, simplifying a few bits here and there and
improving the trace points.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# d6f215f3 12-Dec-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: split up the xfs_reflink_end_cow work into smaller transactions

In xfs_reflink_end_cow, we allocate a single transaction for the entire
end_cow operation and then loop the CoW fork mappings to move them to
the data fork. This design fails on a heavily fragmented filesystem
where an inode's data fork has exactly one more extent than would fit in
an extents-format fork, because the unmap can collapse the data fork
into extents format (freeing the bmbt block) but the remap can expand
the data fork back into a (newly allocated) bmbt block. If the number
of extents we end up remapping is large, we can overflow the block
reservation because we reserved blocks assuming that we were adding
mappings into an already-cleared area of the data fork.

Let's say we have 8 extents in the data fork, 8 extents in the CoW fork,
and the data fork can hold at most 7 extents before needing to convert
to btree format; and that blocks A-P are discontiguous single-block
extents:

0......7
D: ABCDEFGH
C: IJKLMNOP

When a write to file blocks 0-7 completes, we must remap I-P into the
data fork. We start by removing H from the btree-format data fork. Now
we have 7 extents, so we convert the fork to extents format, freeing the
bmbt block. We then move P into the data fork and it now has 8 extents
again. We must convert the data fork back to btree format, requiring a
block allocation. If we repeat this sequence for blocks 6-5-4-3-2-1-0,
we'll need a total of 8 block allocations to remap all 8 blocks. We
reserved only enough blocks to handle one btree split (5 blocks on a 4k
block filesystem), which means we overflow the block reservation.

To fix this issue, create a separate helper function to remap a single
extent, and change _reflink_end_cow to call it in a tight loop over the
entire range we're completing. As a side effect this also removes the
size restrictions on how many extents we can end_cow at a time, though
nobody ever hit that. It is not reasonable to reserve N blocks to remap
N blocks.

Note that this can be reproduced after ~320 million fsx ops while
running generic/938 (long soak directio fsx exerciser):

XFS: Assertion failed: tp->t_blk_res >= tp->t_blk_res_used, file: fs/xfs/xfs_trans.c, line: 116
<machine registers snipped>
Call Trace:
xfs_trans_dup+0x211/0x250 [xfs]
xfs_trans_roll+0x6d/0x180 [xfs]
xfs_defer_trans_roll+0x10c/0x3b0 [xfs]
xfs_defer_finish_noroll+0xdf/0x740 [xfs]
xfs_defer_finish+0x13/0x70 [xfs]
xfs_reflink_end_cow+0x2c6/0x680 [xfs]
xfs_dio_write_end_io+0x115/0x220 [xfs]
iomap_dio_complete+0x3f/0x130
iomap_dio_rw+0x3c3/0x420
xfs_file_dio_aio_write+0x132/0x3c0 [xfs]
xfs_file_write_iter+0x8b/0xc0 [xfs]
__vfs_write+0x193/0x1f0
vfs_write+0xba/0x1c0
ksys_write+0x52/0xc0
do_syscall_64+0x50/0x160
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 2c307174 19-Nov-2018 Dave Chinner <dchinner@redhat.com>

xfs: flush removing page cache in xfs_reflink_remap_prep

On a sub-page block size filesystem, fsx is failing with a data
corruption after a series of operations involving copying a file
with the destination offset beyond EOF of the destination of the file:

8093(157 mod 256): TRUNCATE DOWN from 0x7a120 to 0x50000 ******WWWW
8094(158 mod 256): INSERT 0x25000 thru 0x25fff (0x1000 bytes)
8095(159 mod 256): COPY 0x18000 thru 0x1afff (0x3000 bytes) to 0x2f400
8096(160 mod 256): WRITE 0x5da00 thru 0x651ff (0x7800 bytes) HOLE
8097(161 mod 256): COPY 0x2000 thru 0x5fff (0x4000 bytes) to 0x6fc00

The second copy here is beyond EOF, and it is to sub-page (4k) but
block aligned (1k) offset. The clone runs the EOF zeroing, landing
in a pre-existing post-eof delalloc extent. This zeroes the post-eof
extents in the page cache just fine, dirtying the pages correctly.

The problem is that xfs_reflink_remap_prep() now truncates the page
cache over the range that it is copying it to, and rounds that down
to cover the entire start page. This removes the dirty page over the
delalloc extent from the page cache without having written it back.
Hence later, when the page cache is flushed, the page at offset
0x6f000 has not been written back and hence exposes stale data,
which fsx trips over less than 10 operations later.

Fix this by changing xfs_reflink_remap_prep() to use
xfs_flush_unmap_range().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 59e42931 14-Nov-2018 Brian Foster <bfoster@redhat.com>

xfs: fix shared extent data corruption due to missing cow reservation

Page writeback indirectly handles shared extents via the existence
of overlapping COW fork blocks. If COW fork blocks exist, writeback
always performs the associated copy-on-write regardless if the
underlying blocks are actually shared. If the blocks are shared,
then overlapping COW fork blocks must always exist.

fstests shared/010 reproduces a case where a buffered write occurs
over a shared block without performing the requisite COW fork
reservation. This ultimately causes writeback to the shared extent
and data corruption that is detected across md5 checks of the
filesystem across a mount cycle.

The problem occurs when a buffered write lands over a shared extent
that crosses an extent size hint boundary and that also happens to
have a partial COW reservation that doesn't cover the start and end
blocks of the data fork extent.

For example, a buffered write occurs across the file offset (in FSB
units) range of [29, 57]. A shared extent exists at blocks [29, 35]
and COW reservation already exists at blocks [32, 34]. After
accommodating a COW extent size hint of 32 blocks and the existing
reservation at offset 32, xfs_reflink_reserve_cow() allocates 32
blocks of reservation at offset 0 and returns with COW reservation
across the range of [0, 34]. The associated data fork extent is
still [29, 35], however, which isn't fully covered by the COW
reservation.

This leads to a buffered write at file offset 35 over a shared
extent without associated COW reservation. Writeback eventually
kicks in, performs an overwrite of the underlying shared block and
causes the associated data corruption.

Update xfs_reflink_reserve_cow() to accommodate the fact that a
delalloc allocation request may not fully cover the extent in the
data fork. Trim the data fork extent appropriately, just as is done
for shared extent boundaries and/or existing COW reservations that
happen to overlap the start of the data fork extent. This prevents
shared/010 failures due to data corruption on reflink enabled
filesystems.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# bf4a1fcf 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: remove [cm]time update from reflink calls

Now that the vfs remap helper dirties the inode [cm]time for us, xfs no
longer needs to do that on its own.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 3fc9f5e4 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: remove xfs_reflink_remap_range

Since xfs_file_remap_range is a thin wrapper, move the contents of
xfs_reflink_remap_range into the shell. This cuts down on the vfs
calls being made from internal xfs code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 7a6ccf00 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: remove redundant remap partial EOF block checks

Now that we've moved the partial EOF block checks to the VFS helpers, we
can remove the redundant functionality from XFS.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 3f68c1f5 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: support returning partial reflink results

Back when the XFS reflink code only supported clone_file_range, we were
only able to return zero or negative error codes to userspace. However,
now that copy_file_range (which returns bytes copied) can use XFS'
clone_file_range, we have the opportunity to return partial results.
For example, if userspace sends a 1GB clone request and we run out of
space halfway through, we at least can tell userspace that we completed
512M of that request like a regular write.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 9f04aaff 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: clean up xfs_reflink_remap_blocks call site

Move the offset <-> blocks unit conversions into
xfs_reflink_remap_blocks to make the call site less ugly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 4918ef4e 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: fix pagecache truncation prior to reflink

Prior to remapping blocks, it is necessary to remove pages from the
destination file's page cache. Unfortunately, the truncation is not
aggressive enough -- if page size > block size, we'll end up zeroing
subpage blocks instead of removing them. So, round the start offset
down and the end offset up to page boundaries. We already wrote all
the dirty data so the larger range shouldn't be a problem.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 8c5c836b 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

vfs: clean up generic_remap_file_range_prep return value

Since the remap prep function can update the length of the remap
request, we can change this function to return the usual return status
instead of the odd behavior it has now.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 42ec3d4c 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

vfs: make remap_file_range functions take and return bytes completed

Change the remap_file_range functions to take a number of bytes to
operate upon and return the number of bytes they operated on. This is a
requirement for allowing fs implementations to return short clone/dedupe
results to the user, which will enable us to obey resource limits in a
graceful manner.

A subsequent patch will enable copy_file_range to signal to the
->clone_file_range implementation that it can handle a short length,
which will be returned in the function's return value. For now the
short return is not implemented anywhere so the behavior won't change --
either copy_file_range manages to clone the entire range or it tries an
alternative.

Neither clone ioctl can take advantage of this, alas.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 8dde90bc 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

vfs: remap helper should update destination inode metadata

Extend generic_remap_file_range_prep to handle inode metadata updates
when remapping into a file. If the operation can possibly alter the
file contents, we must update the ctime and mtime and remove security
privileges, just like we do for regular file writes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# a91ae49b 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

vfs: pass remap flags to generic_remap_file_range_prep

Plumb the remap flags through the filesystem from the vfs function
dispatcher all the way to the prep function to prepare for behavior
changes in subsequent patches.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# a83ab01a 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

vfs: rename vfs_clone_file_prep to be more descriptive

The vfs_clone_file_prep is a generic function to be called by filesystem
implementations only. Rename the prefix to generic_ and make it more
clear that it applies to remap operations, not just clones.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 1383a7ed 29-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

vfs: check file ranges before cloning files

Move the file range checks from vfs_clone_file_prep into a separate
generic_remap_checks function so that all the checks are collected in a
central location. This forms the basis for adding more checks from
generic_write_checks that will make cloning's input checking more
consistent with write input checking.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 032dc923 18-Oct-2018 Christoph Hellwig <hch@lst.de>

xfs: fix fork selection in xfs_find_trim_cow_extent

We should want to write directly into the data fork for blocks that don't
have an extent in the COW fork covering them yet.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# d392bc81 18-Oct-2018 Christoph Hellwig <hch@lst.de>

xfs: remove the unused trimmed argument from xfs_reflink_trim_around_shared

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# fc439464 18-Oct-2018 Christoph Hellwig <hch@lst.de>

xfs: remove the unused shared argument to xfs_reflink_reserve_cow

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# b3998900 05-Oct-2018 Dave Chinner <david@fromorbit.com>

xfs: fix data corruption w/ unaligned reflink ranges

When reflinking sub-file ranges, a data corruption can occur when
the source file range includes a partial EOF block. This shares the
unknown data beyond EOF into the second file at a position inside
EOF, exposing stale data in the second file.

XFS only supports whole block sharing, but we still need to
support whole file reflink correctly. Hence if the reflink
request includes the last block of the souce file, only proceed with
the reflink operation if it lands at or past the destination file's
current EOF. If it lands within the destination file EOF, reject the
entire request with -EINVAL and make the caller go the hard way.

This avoids the data corruption vector, but also avoids disruption
of returning EINVAL to userspace for the common case of whole file
cloning.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# dceeb47b 05-Oct-2018 Dave Chinner <dchinner@redhat.com>

xfs: fix data corruption w/ unaligned dedupe ranges

A deduplication data corruption is Exposed by fstests generic/505 on
XFS. It is caused by extending the block match range to include the
partial EOF block, but then allowing unknown data beyond EOF to be
considered a "match" to data in the destination file because the
comparison is only made to the end of the source file. This corrupts
the destination file when the source extent is shared with it.

XFS only supports whole block dedupe, but we still need to appear to
support whole file dedupe correctly. Hence if the dedupe request
includes the last block of the souce file, don't include it in the
actual XFS dedupe operation. If the rest of the range dedupes
successfully, then report the partial last block as deduped, too, so
that userspace sees it as a successful dedupe rather than return
EINVAL because we can't dedupe unaligned blocks.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 7debbf01 05-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: update ctime and remove suid before cloning files

Before cloning into a file, update the ctime and remove sensitive
attributes like suid, just like we'd do for a regular file write.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 410fdc72 05-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: zero posteof blocks when cloning above eof

When we're reflinking between two files and the destination file range
is well beyond the destination file's EOF marker, zero any posteof
speculative preallocations in the destination file so that we don't
expose stale disk contents. The previous strategy of trying to clear
the preallocations does not work if the destination file has the
PREALLOC flag set.

Uncovered by shared/010.

Reported-by: Zorro Lang <zlang@redhat.com>
Bugzilla-id: https://bugzilla.kernel.org/show_bug.cgi?id=201259
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 0d41e1d2 05-Oct-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: refactor clonerange preparation into a separate helper

Refactor all the reflink preparation steps into a separate helper
that we'll use to land all the upcoming fixes for insufficient input
checks.

This rework also moves the invalidation of the destination range to
the prep function so that it is done before the range is remapped.
This ensures that nobody can access the data in range being remapped
until the remap is complete.

[dgc: fix xfs_reflink_remap_prep() return value and caller check to
handle vfs_clone_file_prep_inodes() returning 0 to mean "nothing to
do". ]

[dgc: make sure length changed by vfs_clone_file_prep_inodes() gets
propagated back to XFS code that does the remapping. ]

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# f5f3f959 28-Sep-2018 Christoph Hellwig <hch@lst.de>

xfs: skip delalloc COW blocks in xfs_reflink_end_cow

The iomap direct I/O code issues a single ->end_io call for the whole
I/O request, and if some of the extents cowered needed a COW operation
it will call xfs_reflink_end_cow over the whole range.

When we do AIO writes we drop the iolock after doing the initial setup,
but before the I/O completion. Between dropping the lock and completing
the I/O we can have a racing buffered write create new delalloc COW fork
extents in the region covered by the outstanding direct I/O write, and
thus see delalloc COW fork extents in xfs_reflink_end_cow. As
concurrent writes are fundamentally racy and no guarantees are given we
can simply skip those.

This can be easily reproduced with xfstests generic/208 in always_cow
mode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# df307077 28-Sep-2018 Dave Chinner <dchinner@redhat.com>

xfs: fix transaction leak in xfs_reflink_allocate_cow()

When xfs_reflink_allocate_cow() allocates a transaction, it drops
the ILOCK to perform the operation. This Introduces a race condition
where another thread modifying the file can perform the COW
allocation operation underneath us. This result in the retry loop
finding an allocated block and jumping straight to the conversion
code. It does not, however, cancel the transaction it holds and so
this gets leaked. This results in a lockdep warning:

================================================
WARNING: lock held when returning to user space!
4.18.5 #1 Not tainted
------------------------------------------------
worker/6123 is leaving the kernel with locks still held!
1 lock held by worker/6123:
#0: 000000009eab4f1b (sb_internal#2){.+.+}, at: xfs_trans_alloc+0x17c/0x220

And eventually the filesystem deadlocks because it runs out of log
space that is reserved by the leaked transaction and never gets
released.

The logic flow in xfs_reflink_allocate_cow() is a convoluted mess of
gotos - it's no surprise that it has bug where the flow through
several goto jumps then fails to clean up context from a non-obvious
logic path. CLean up the logic flow and make sure every path does
the right thing.

Reported-by: Alexander Y. Fomichev <git.user@gmail.com>
Tested-by: Alexander Y. Fomichev <git.user@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200981
Signed-off-by: Dave Chinner <dchinner@redhat.com>
[hch: slight refactor]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 9d9e6233 01-Aug-2018 Brian Foster <bfoster@redhat.com>

xfs: fold dfops into the transaction

struct xfs_defer_ops has now been reduced to a single list_head. The
external dfops mechanism is unused and thus everywhere a (permanent)
transaction is accessible the associated dfops structure is as well.

Remove the xfs_defer_ops structure and fold the list_head into the
transaction. Also remove the last remnant of external dfops in
xfs_trans_dup().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 0f37d178 01-Aug-2018 Brian Foster <bfoster@redhat.com>

xfs: pass transaction to xfs_defer_add()

The majority of remaining references to struct xfs_defer_ops in XFS
are associated with xfs_defer_add(). At this point, there are no
more external xfs_defer_ops users left. All instances of
xfs_defer_ops are embedded in the transaction, which means we can
safely pass the transaction down to the dfops add interface.

Update xfs_defer_add() to receive the transaction as a parameter.
Various subsystems implement wrappers to allocate and construct the
context specific data structures for the associated deferred
operation type. Update these to also carry the transaction down as
needed and clean up unused dfops parameters along the way.

This removes most of the remaining references to struct
xfs_defer_ops throughout the code and facilitates removal of the
structure.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[darrick: fix unused variable warnings with ftrace disabled]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 9b1f4e98 01-Aug-2018 Brian Foster <bfoster@redhat.com>

xfs: cancel dfops on xfs_defer_finish() error

The current semantics of xfs_defer_finish() require the caller to
call xfs_defer_cancel() on error. This is slightly inconsistent with
transaction commit error handling where a failed commit cleans up
the transaction before returning.

More significantly, the only requirement for exposure of
->dop_pending outside of xfs_defer_finish() is so that
xfs_defer_cancel() can drain it on error. Since the only recourse of
xfs_defer_finish() errors is cancellation, mirror the transaction
logic and cancel remaining dfops before returning from
xfs_defer_finish() with an error.

Beside simplifying xfs_defer_finish() semantics, this ensures that
xfs_defer_finish() always returns with an empty ->dop_pending and
thus facilitates removal of the list from xfs_defer_ops.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# a8198666 01-Aug-2018 Brian Foster <bfoster@redhat.com>

xfs: automatic dfops inode relogging

Inodes that are held across deferred operations are explicitly
joined to the dfops structure to ensure appropriate relogging.
While inodes are currently joined explicitly, we can detect the
conditions that require relogging at dfops finish time by inspecting
the transaction item list for inodes with ili_lock_flags == 0.

Replace the xfs_defer_ijoin() infrastructure with such detection and
automatic relogging of held inodes. This eliminates the need for the
per-dfops inode list, replaced by an on-stack variant in
xfs_defer_trans_roll().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 488c919a 01-Aug-2018 Brian Foster <bfoster@redhat.com>

xfs: add missing defer ijoins for held inodes

Log items that require relogging during deferred operations
processing are explicitly joined to the associated dfops via the
xfs_defer_*join() helpers. These calls imply that the associated
object is "held" by the transaction such that when rolled, the item
can be immediately joined to a follow up transaction. For buffers,
this means the buffer remains locked and held after each roll. For
inodes, this means that the inode remains locked.

Failure to join a held item to the dfops structure means the
associated object pins the tail of the log while dfops processing
completes, because the item never relogs and is not unlocked or
released until deferred processing completes.

Currently, all buffers that are held in transactions (XFS_BLI_HOLD)
with deferred operations are explicitly joined to the dfops. This is
not the case for inodes, however, as various contexts defer
operations to transactions with held inodes without explicit joins
to the associated dfops (and thus not relogging).

While this is not a catastrophic problem, it is not ideal. Given
that we want to eventually relog such items automatically during
dfops processing, start by explicitly adding these missing
xfs_defer_ijoin() calls. A call is added everywhere an inode is
joined to a transaction without transferring lock ownership and
said transaction runs deferred operations.

All xfs_defer_ijoin() calls will eventually be replaced by automatic
dfops inode relogging. This patch essentially implements the
behavior change that would otherwise occur due to automatic inode
dfops relogging.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 51d62690 17-Jul-2018 Christoph Hellwig <hch@lst.de>

xfs: introduce a new xfs_inode_has_cow_data helper

We have a few places that already check if an inode has actual data in
the COW fork to avoid work on reflink inodes that do not actually have
outstanding COW blocks. There are a few more places that can avoid
working if doing the same check, so add a documented helper for this
condition and use it in all places where it makes sense.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 9e28a242 24-Jul-2018 Brian Foster <bfoster@redhat.com>

xfs: drop unnecessary xfs_defer_finish() dfops parameter

Every caller of xfs_defer_finish() now passes the transaction and
its associated ->t_dfops. The xfs_defer_ops parameter is therefore
no longer necessary and can be removed.

Since most xfs_defer_finish() callers also have to consider
xfs_defer_cancel() on error, update the latter to also receive the
transaction for consistency. The log recovery code contains an
outlier case that cancels a dfops directly without an available
transaction. Retain an internal wrapper to support this outlier case
for the time being.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# c8eac49e 24-Jul-2018 Brian Foster <bfoster@redhat.com>

xfs: remove all boilerplate defer init/finish code

At this point, the transaction subsystem completely manages deferred
items internally such that the common and boilerplate
xfs_trans_alloc() -> xfs_defer_init() -> xfs_defer_finish() ->
xfs_trans_commit() sequence can be replaced with a simple
transaction allocation and commit.

Remove all such boilerplate deferred ops code. In doing so, we
change each case over to use the dfops in the transaction and
specifically eliminate:

- The on-stack dfops and associated xfs_defer_init() call, as the
internal dfops is initialized on transaction allocation.
- xfs_bmap_finish() calls that precede a final xfs_trans_commit() of
a transaction.
- xfs_defer_cancel() calls in error handlers that precede a
transaction cancel.

The only deferred ops calls that remain are those that are
non-deterministic with respect to the final commit of the associated
transaction or are open-coded due to special handling.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 1e5ae199 24-Jul-2018 Brian Foster <bfoster@redhat.com>

xfs: use internal dfops in cow blocks cancel

All callers either explicitly initialize a dfops or pass a
transaction with an internal dfops. Drop the hacky old dfops
replacement logic and use the one associated with the transaction.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 0b04b6b8 19-Jul-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: trivial xfs_btree_del_cursor cleanups

The error argument to xfs_btree_del_cursor already understands the
"nonzero for error" semantics, so remove pointless error testing in the
callers and pass it directly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 5fdd9794 11-Jul-2018 Brian Foster <bfoster@redhat.com>

xfs: remove xfs_defer_init() firstblock param

All but one caller of xfs_defer_init() passes in the ->t_firstblock
of the associated transaction. The one outlier is
xlog_recover_process_intents(), which simply passes a dummy value
because a valid pointer is required. This firstblock variable can
simply be removed.

At this point we could remove the xfs_defer_init() firstblock
parameter and initialize ->t_firstblock directly. Even that is not
necessary, however, because ->t_firstblock is automatically
reinitialized in the new transaction on a transaction roll. Since
xfs_defer_init() should never occur more than once on a particular
transaction (since the corresponding finish will roll it), replace
the reinit from xfs_defer_init() with an assert that verifies the
transaction has a NULLFSBLOCK firstblock.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 381d5928 11-Jul-2018 Brian Foster <bfoster@redhat.com>

xfs: use ->t_firstblock in reflink cow block cancel

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 2af52842 11-Jul-2018 Brian Foster <bfoster@redhat.com>

xfs: remove xfs_bunmapi() firstblock param

All callers pass ->t_firstblock from the current transaction.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# a7beabea 11-Jul-2018 Brian Foster <bfoster@redhat.com>

xfs: remove xfs_bmapi_write() firstblock param

All callers pass ->t_firstblock from the current transaction.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 37283797 11-Jul-2018 Brian Foster <bfoster@redhat.com>

xfs: use ->t_firstblock for all xfs_bunmapi() callers

Convert all xfs_bunmapi() callers to ->t_firstblock.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 650919f1 11-Jul-2018 Brian Foster <bfoster@redhat.com>

xfs: use ->t_firstblock for all xfs_bmapi_write() callers

Convert all xfs_bmapi_write() users to ->t_firstblock.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 3ae2d891 11-Jul-2018 Brian Foster <bfoster@redhat.com>

xfs: allow null firstblock in xfs_bmapi_write() when tp is null

xfs_bmapi_write() always expects a valid firstblock pointer. It
immediately dereferences the pointer to help determine how to
initialize the bma.minleft field. The remaining accesses are
related to modifying btree format forks, which is only relevant for
!COW fork callers.

The reflink code passes a NULL transaction to xfs_bmapi_write() in a
couple places that do COW fork unwritten conversion. The purpose of
the firstblock field is to track the first block allocation in the
current transaction, so technically firstblock should not be
required for these callers either.

Tweak xfs_bmapi_write() to initialize the bma correctly without
accessing the firstblock pointer if no transaction is provided in
the first place. Update the reflink callers to pass NULL instead of
otherwise unused firstblock references.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# bcd2c9f3 11-Jul-2018 Brian Foster <bfoster@redhat.com>

xfs: refactor dfops init to attach to transaction

Most callers of xfs_defer_init() immediately attach the dfops
structure to a transaction. Add a transaction parameter to eliminate
much of this boilerplate code. This also helps self-document the
fact that many codepaths now expect a dfops pointer implicitly via
xfs_trans->t_dfops.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 27356a06 11-Jul-2018 Brian Foster <bfoster@redhat.com>

xfs: use ->t_dfops in cancel cow blocks operation

Use ->t_dfops of the transaction from the caller. Reset it before we
return to avoid leaks of local stack memory.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# ed7ef8e5 11-Jul-2018 Brian Foster <bfoster@redhat.com>

xfs: remove unused btree cursor bc_private.a.dfops field

The xfs_btree_cur.bc_private.a.dfops field is only ever initialized
by the refcountbt cursor init function. The only caller of that
function with a non-NULL dfops is from deferred completion context,
which already has attached to ->t_dfops.

In addition to that, the only actual reference of a.dfops is the
cursor duplication function, which means the field is effectively
unused.

Remove the dfops field from the bc_private.a union. Any future users
can acquire the dfops from the transaction. This patch does not
change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# ccd9d911 11-Jul-2018 Brian Foster <bfoster@redhat.com>

xfs: remove xfs_bunmapi() dfops param

Now that all xfs_bunmapi() callers use ->t_dfops, remove the
unnecessary parameter and access ->t_dfops directly. This patch does
not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 4bcfa613 11-Jul-2018 Brian Foster <bfoster@redhat.com>

xfs: use ->t_dfops for all xfs_bunmapi() callers

Use ->t_dfops for all remaining xfs_bunmapi() callers. This prepares
the latter to no longer require a dfops parameter.

Note that xfs_itruncate_extents_flags() associates a local dfops
with a transaction provided from the caller. Since there are
multiple callers, set and reset ->t_dfops before the function
returns to avoid exposure of stack memory to the caller.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 6e702a5d 11-Jul-2018 Brian Foster <bfoster@redhat.com>

xfs: remove xfs_bmapi_write() dfops param

Now that all callers use ->t_dfops, the xfs_bmapi_write() dfops
parameter is no longer necessary. Remove it and access ->t_dfops
directly. This patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 175d1a01 11-Jul-2018 Brian Foster <bfoster@redhat.com>

xfs: use ->t_dfops for all xfs_bmapi_write() callers

Attach ->t_dfops for all remaining callers of xfs_bmapi_write().
This prepares the latter to no longer require a separate dfops
parameter.

Note that xfs_symlink() already uses ->t_dfops. Fix up the local
references for consistency.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 8a749386 11-Jul-2018 Brian Foster <bfoster@redhat.com>

xfs: cow unwritten conversion uses uninitialized dfops

A couple COW fork unwritten extent conversion helpers pass an
uninitialized dfops pointer to xfs_bmapi_write(). This does not
cause problems because conversion does not use a transaction or the
dfops structure for the COW fork. Drop the uninitialized usage of
dfops in these codepaths and pass NULL along to xfs_bmapi_write()
instead.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 060d4eaa 11-Jul-2018 Christoph Hellwig <hch@lst.de>

xfs: remove xfs_reflink_find_cow_mapping

We only have one caller left, and open coding the simple extent list
lookup in it allows us to make the code both more understandable and
reuse calculations and variables already present.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# fca8c805 11-Jul-2018 Christoph Hellwig <hch@lst.de>

xfs: remove xfs_reflink_trim_irec_to_next_cow

We already have to check for overlapping COW extents everytime we
come back to a page in xfs_writepage_map / xfs_map_cow, so this
additional trim is not required.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 0b61f8a4 05-Jun-2018 Dave Chinner <dchinner@redhat.com>

xfs: convert to SPDX license tags

Remove the verbose license text from XFS files and replace them
with SPDX tags. This does not change the license of any of the code,
merely refers to the common, up-to-date license files in LICENSES/

This change was mostly scripted. fs/xfs/Makefile and
fs/xfs/libxfs/xfs_fs.h were modified by hand, the rest were detected
and modified by the following command:

for f in `git grep -l "GNU General" fs/xfs/` ; do
echo $f
cat $f | awk -f hdr.awk > $f.new
mv -f $f.new $f
done

And the hdr.awk script that did the modification (including
detecting the difference between GPL-2.0 and GPL-2.0+ licenses)
is as follows:

$ cat hdr.awk
BEGIN {
hdr = 1.0
tag = "GPL-2.0"
str = ""
}

/^ \* This program is free software/ {
hdr = 2.0;
next
}

/any later version./ {
tag = "GPL-2.0+"
next
}

/^ \*\// {
if (hdr > 0.0) {
print "// SPDX-License-Identifier: " tag
print str
print $0
str=""
hdr = 0.0
next
}
print $0
next
}

/^ \* / {
if (hdr > 1.0)
next
if (hdr > 0.0) {
if (str != "")
str = str "\n"
str = str $0
next
}
print $0
next
}

/^ \*/ {
if (hdr > 0.0)
next
print $0
next
}

// {
if (hdr > 0.0) {
if (str != "")
str = str "\n"
str = str $0
next
}
print $0
}

END { }
$

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 4882c19d 04-May-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: split out dqget for inodes from regular dqget

There are two uses of dqget here -- one is to return the dquot for a
given type and id, and the other is to return the dquot for a given type
and inode. Those are two separate things, so split them into two
smaller functions.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# c14cfcca 04-May-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: remove unnecessary xfs_qm_dqattach parameter

The flags argument is always zero, get rid of it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 844e5e74 09-May-2018 Dave Chinner <dchinner@redhat.com>

xfs: fix double ijoin in xfs_reflink_clear_inode_flag()

xfs_reflink_clear_inode_flag double-joins an inode to a transaction,
which is not allowed. Fix that and document that the caller must have
already joined it.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
[darrick: edit out trace for nonexistent ASSERT]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# c5295c6a 09-May-2018 Dave Chinner <dchinner@redhat.com>

xfs: fix double ijoin in xfs_reflink_cancel_cow_range

xfs_reflink_cancel_cow_range joins an inode twice to the same
transaction. This is not allowed, so fix it and document that the
callers of xfs_reflink_cancel_cow_blocks() must have already joined the
inode to the permanent transaction passed in.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
[darrick: edited the commit log to remove trace for nonexistent ASSERT]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# df79b81b 14-Mar-2018 Christoph Hellwig <hch@lst.de>

xfs: minor cleanup for xfs_reflink_end_cow

Use xfs_iext_prev_extent to skip to the previous extent instead of
opencoding it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# c7dbe3f2 14-Mar-2018 Christoph Hellwig <hch@lst.de>

xfs: assert that xfs_reflink_allocate_cow is called with XFS_ILOCK_EXCL

Now that we convert COW preallocations from unwritten to real on every
call this function needs to be called with the ilock held exclusively.

Fortunately we already do that, but update the assert to match.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 21592863 09-Mar-2018 Brian Foster <bfoster@redhat.com>

xfs: rename agfl perag res type to rmapbt

The AGFL perag reservation type accounts all allocations that feed
into (or are released from) the allocation group free list (agfl).
The purpose of the reservation is to support worst case conditions
for the reverse mapping btree (rmapbt). As such, the agfl
reservation usage accounting only considers rmapbt usage when the
in-core counters are initialized at mount time.

This implementation inconsistency leads to divergence of the in-core
and on-disk usage accounting over time. In preparation to resolve
this inconsistency and adjust the AGFL reservation into an rmapbt
specific reservation, rename the AGFL reservation type and
associated accounting fields to something more rmapbt-specific. Also
fix up a couple tracepoints that incorrectly use the AGFL
reservation type to pass the agfl state of the associated extent
where the raw reservation type is expected.

Note that this patch does not change perag reservation behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 4df0f7f1 06-Mar-2018 Dave Chinner <dchinner@redhat.com>

xfs: fix transaction allocation deadlock in IO path

xfs_trans_alloc() does GFP_KERNEL allocation, and we can call it
while holding pages locked for writeback in the ->writepages path.
The memory allocation is allowed to wait on pages under writeback,
and so can wait on pages that are tagged as writeback by the
caller.

This affects both pre-IO submission and post-IO submission paths.
Hence xfs_setsize_trans_alloc(), xfs_reflink_end_cow(),
xfs_iomap_write_unwritten() and xfs_reflink_cancel_cow_range().
xfs_iomap_write_unwritten() already does the right thing, but the
others don't. Fix them.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Fixes: 281627df3eb5 ("xfs: log file size updates at I/O completion time")
Fixes: 43caeb187deb9 ("xfs: move mappings from cow fork to data fork after copy-write)"
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 9f37bd11 26-Jan-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: check reflink allocation mappings

There's a really bad bug in xfs_reflink_allocate_cow -- if bmapi_write
can return a zero error code but no mappings. This happens if there's
an extent size hint (which causes allocation requests to be rounded to
extsz granularity internally), but there wasn't a big enough chunk of
free space to start filling at the extsz granularity and fill even one
block of the range that we actually requested.

In any case, if we got no mappings we can't possibly do anything useful
with the contents of imap, so we must bail out with ENOSPC here.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 4b4c1326 19-Jan-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: treat CoW fork operations as delalloc for quota accounting

Since the CoW fork only exists in memory, it is incorrect to update the
on-disk quota block counts when we modify the CoW fork. Unlike the data
fork, even real extents in the CoW fork are only delalloc-style
reservations (on-disk they're owned by the refcountbt) so they must not
be tracked in the on disk quota info. Ensure the i_delayed_blks
accounting reflects this too.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 01c2e13d 18-Jan-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: only grab shared inode locks for source file during reflink

Reflink and dedupe operations remap blocks from a source file into a
destination file. The destination file needs exclusive locks on all
levels because we're updating its block map, but the source file isn't
undergoing any block map changes so we can use a shared lock.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 7c2d238a 26-Jan-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: allow xfs_lock_two_inodes to take different EXCL/SHARED modes

Refactor xfs_lock_two_inodes to take separate locking modes for each
inode. Specifically, this enables us to take a SHARED lock on one inode
and an EXCL lock on the other. The lock class (MMAPLOCK/ILOCK) must be
the same for each inode.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 1364b1d4 18-Jan-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: reflink should break pnfs leases before sharing blocks

Before we share blocks between files, we need to break the pnfs leases
on the layout before we start slicing and dicing the block map. The
structure of this function sets us up for the lock contention reduction
in the next patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 09ac8623 19-Jan-2018 Darrick J. Wong <darrick.wong@oracle.com>

xfs: call xfs_qm_dqattach before performing reflink operations

Ensure that we've attached all the necessary dquots before performing
reflink operations so that quota accounting is accurate.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 86d692bf 14-Dec-2017 Darrick J. Wong <darrick.wong@oracle.com>

xfs: set cowblocks tag for direct cow writes too

If a user performs a direct CoW write, we end up loading the CoW fork
with preallocated extents. Therefore, we must set the cowblocks tag so
that they can be cleared out if we run low on space.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# a192de26 10-Dec-2017 Darrick J. Wong <darrick.wong@oracle.com>

xfs: allow CoW remap transactions to use reserve blocks

Since we as yet have no way of holding on to the indlen blocks that are
reserved as part of CoW fork delalloc reservations, let the CoW remap
transaction dip into the reserves so that we avoid failing writes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 9d40fba8 10-Dec-2017 Darrick J. Wong <darrick.wong@oracle.com>

xfs: avoid infinite loop when cancelling CoW blocks after writeback failure

When we're cancelling a cow range, we don't always delete each extent
that we iterate, so we have to move icur backwards in the list to avoid
an infinite loop.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 73353f48 10-Dec-2017 Darrick J. Wong <darrick.wong@oracle.com>

xfs: relax is_reflink_inode assert in xfs_reflink_find_cow_mapping

We don't hold the ilock through the entire sequence of xfs_writepage_map
-> xfs_map_cow -> xfs_reflink_find_cow_mapping. This means that we can
race with another thread that is trying to clear the inode reflink flag,
with the result that the flag is set for the xfs_map_cow check but
cleared before we get to the assert in find_cow_mapping. When this
happens, we blow the assert even though everything is fine.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 5c989a0e 10-Dec-2017 Darrick J. Wong <darrick.wong@oracle.com>

xfs: remove dest file's post-eof preallocations before reflinking

If we try to reflink into a file with post-eof preallocations at an
offset well past the preallocations, we increase i_size as one would
expect. However, those allocations do not have page cache backing them,
so they won't get cleaned out on their own. This leads to asserts in
the collapse/insert range code and xfs_destroy_inode when they encounter
delalloc extents they weren't expecting to find.

Since there are plenty of other places where we dump those post-eof
blocks, do the same to the reflink destination file before we start
remapping extents. This was found by adding clonerange support to
fsstress and running it in write-only mode.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# eaf0ec30 06-Dec-2017 Pravin Shedge <pravin.shedge4linux@gmail.com>

fs: xfs: remove duplicate includes

These duplicate includes have been found with scripts/checkincludes.pl but
they have been removed manually to avoid removing false positives.

Signed-off-by: Pravin Shedge <pravin.shedge4linux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# b121459c 03-Nov-2017 Christoph Hellwig <hch@lst.de>

xfs: simplify xfs_reflink_convert_cow

Instead of looking up extents to convert and calling xfs_bmapi_write on
each of them just let xfs_bmapi_write handle the full range. To make
this robust add a new XFS_BMAPI_CONVERT_ONLY that only converts ranges
and never allocates blocks.

[darrick: shorten the stringified CONVERT_ONLY trace flag]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 41caabd0 03-Nov-2017 Christoph Hellwig <hch@lst.de>

xfs: iterate backwards in xfs_reflink_cancel_cow_blocks

Match the iteration order for extent deletion in the truncate and
reflink I/O completion path.

This also happens to make implementing the new incore extent list
a lot easier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# b2b1712a 03-Nov-2017 Christoph Hellwig <hch@lst.de>

xfs: introduce the xfs_iext_cursor abstraction

Add a new xfs_iext_cursor structure to hide the direct extent map
index manipulations. In addition to the existing lookup/get/insert/
remove and update routines new primitives to get the first and last
extent cursor, as well as moving up and down by one extent are
provided. Also new are convenience to increment/decrement the
cursor and retreive the new extent, as well as to peek into the
previous/next extent without updating the cursor and last but not
least a macro to iterate over all extents in a fork.

[darrick: rename for_each_iext to for_each_xfs_iext]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# dc56015f 23-Oct-2017 Christoph Hellwig <hch@lst.de>

xfs: add a new xfs_iext_lookup_extent_before helper

This helper looks up the last extent the covers space before the passed
in block number. This is useful for truncate and similar operations that
operate backwards over the extent list. For xfs_bunmapi it also is
a slight optimization as we can return early if there are not extents
at or below the end of the to be truncated range.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# e12199f8 03-Oct-2017 Christoph Hellwig <hch@lst.de>

xfs: handle racy AIO in xfs_reflink_end_cow

If we got two AIO writes into a COW area the second one might not have any
COW extents left to convert. Handle that case gracefully instead of
triggering an assert or accessing beyond the bounds of the extent list.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 8ad7c629 28-Aug-2017 Christoph Hellwig <hch@lst.de>

xfs: remove the ip argument to xfs_defer_finish

And instead require callers to explicitly join the inode using
xfs_defer_ijoin. Also consolidate the defer error handling in
a few places using a goto label.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 10479e2d 17-Jul-2017 Darrick J. Wong <darrick.wong@oracle.com>

xfs: check _alloc_read_agf buffer pointer before using

In some circumstances, _alloc_read_agf can return an error code of zero
but also a null AGF buffer pointer. Check for this and jump out.

Fixes-coverity-id: 1415250
Fixes-coverity-id: 1415320
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 4c1a67bd 17-Jul-2017 Darrick J. Wong <darrick.wong@oracle.com>

xfs: set firstfsb to NULLFSBLOCK before feeding it to _bmapi_write

We must initialize the firstfsb parameter to _bmapi_write so that it
doesn't incorrectly treat stack garbage as a restriction on which AGs
it can search for free space.

Fixes-coverity-id: 1402025
Fixes-coverity-id: 1415167
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# ea7cdd7b 16-Jun-2017 Darrick J. Wong <darrick.wong@oracle.com>

xfs: separate function to check if inode shares extents

Separate the "clear reflink flag" function into one function that checks
if the flag is needed, and a second function that checks and clears the
flag. The inode scrub code will want to check the necessity of the flag
without clearing it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# 92ff7285 16-Jun-2017 Darrick J. Wong <darrick.wong@oracle.com>

xfs: reflink find shared should take a transaction

Adapt _reflink_find_shared to take an optional transaction pointer. The
inode scrubber code will need to decide (within transaction context) if
a file has shared blocks. To avoid buffer deadlocks, we must pass the
tp through to this function's utility calls.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>


# fe0be23e 12-Apr-2017 Darrick J. Wong <darrick.wong@oracle.com>

xfs: reserve enough blocks to handle btree splits when remapping

In xfs_reflink_end_cow, we erroneously reserve only enough blocks to
handle adding 1 extent. This is problematic if we fragment free space,
have to do CoW, and then have to perform multiple bmap btree expansions.
Furthermore, the BUI recovery routine doesn't reserve /any/ blocks to
handle btree splits, so log recovery fails after our first error causes
the filesystem to go down.

Therefore, refactor the transaction block reservation macros until we
have a macro that works for our deferred (re)mapping activities, and fix
both problems by using that macro.

With 1k blocks we can hit this fairly often in g/187 if the scratch fs
is big enough.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 9c4f29d3 28-Mar-2017 Christoph Hellwig <hch@lst.de>

xfs: factor out a xfs_bmap_is_real_extent helper

This checks for all the non-normal extent types, including handling both
encodings of delayed allocations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 3802a345 07-Mar-2017 Christoph Hellwig <hch@lst.de>

xfs: only reclaim unwritten COW extents periodically

We only want to reclaim preallocations from our periodic work item.
Currently this is archived by looking for a dirty inode, but that check
is rather fragile. Instead add a flag to xfs_reflink_cancel_cow_* so
that the caller can ask for just cancelling unwritten extents in the COW
fork.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix typos in commit message]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 93aaead5 13-Feb-2017 Darrick J. Wong <darrick.wong@oracle.com>

xfs: fix uninitialized variable in _reflink_convert_cow

Fix an uninitialize variable.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# c5ecb423 06-Feb-2017 Christoph Hellwig <hch@lst.de>

xfs: update ctime and mtime on clone destinatation inodes

We're changing both metadata and data, so we need to update the
timestamps for clone operations. Dedupe on the other hand does
not change file data, and only changes invisible metadata so the
timestamps should not be updated.

This follows existing btrfs behavior.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: remove redundant is_dedupe test]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 3c68d44a 06-Feb-2017 Christoph Hellwig <hch@lst.de>

xfs: allocate direct I/O COW blocks in iomap_begin

Instead of preallocating all the required COW blocks in the high-level
write code do it inside the iomap code, like we do for all other I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# a14234c7 06-Feb-2017 Christoph Hellwig <hch@lst.de>

xfs: go straight to real allocations for direct I/O COW writes

When we allocate COW fork blocks for direct I/O writes we currently first
create a delayed allocation, and then convert it to a real allocation
once we've got the delayed one.

As there is no good reason for that this patch instead makes use call
xfs_bmapi_write from the COW allocation path. The only interesting bits
are a few tweaks the low-level allocator to allow for this, most notably
the need to remove the call to xfs_bmap_extsize_align for the cowextsize
in xfs_bmap_btalloc - for the existing convert case it's a no-op, but
for the direct allocation case it would blow up our block reservation
way beyond what we reserved for the transaction.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# dcf9585a 06-Feb-2017 Christoph Hellwig <hch@lst.de>

xfs: return the converted extent in __xfs_reflink_convert_cow

We'll need it for the direct I/O code. Also rename the function to
xfs_reflink_convert_cow_extent to describe it a bit better.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 5eda4300 02-Feb-2017 Darrick J. Wong <darrick.wong@oracle.com>

xfs: mark speculative prealloc CoW fork extents unwritten

Christoph Hellwig pointed out that there's a potentially nasty race when
performing simultaneous nearby directio cow writes:

"Thread 1 writes a range from B to c

" B --------- C
p

"a little later thread 2 writes from A to B

" A --------- B
p

[editor's note: the 'p' denote cowextsize boundaries, which I added to
make this more clear]

"but the code preallocates beyond B into the range where thread
"1 has just written, but ->end_io hasn't been called yet.
"But once ->end_io is called thread 2 has already allocated
"up to the extent size hint into the write range of thread 1,
"so the end_io handler will splice the unintialized blocks from
"that preallocation back into the file right after B."

We can avoid this race by ensuring that thread 1 cannot accidentally
remap the blocks that thread 2 allocated (as part of speculative
preallocation) as part of t2's write preparation in t1's end_io handler.
The way we make this happen is by taking advantage of the unwritten
extent flag as an intermediate step.

Recall that when we begin the process of writing data to shared blocks,
we create a delayed allocation extent in the CoW fork:

D: --RRRRRRSSSRRRRRRRR---
C: ------DDDDDDD---------

When a thread prepares to CoW some dirty data out to disk, it will now
convert the delalloc reservation into an /unwritten/ allocated extent in
the cow fork. The da conversion code tries to opportunistically
allocate as much of a (speculatively prealloc'd) extent as possible, so
we may end up allocating a larger extent than we're actually writing
out:

D: --RRRRRRSSSRRRRRRRR---
U: ------UUUUUUU---------

Next, we convert only the part of the extent that we're actively
planning to write to normal (i.e. not unwritten) status:

D: --RRRRRRSSSRRRRRRRR---
U: ------UURRUUU---------

If the write succeeds, the end_cow function will now scan the relevant
range of the CoW fork for real extents and remap only the real extents
into the data fork:

D: --RRRRRRRRSRRRRRRRR---
U: ------UU--UUU---------

This ensures that we never obliterate valid data fork extents with
unwritten blocks from the CoW fork.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 22725ce4 19-Dec-2016 Darrick J. Wong <darrick.wong@oracle.com>

vfs: fix isize/pos/len checks for reflink & dedupe

Strengthen the checking of pos/len vs. i_size, clarify the return values
for the clone prep function, and remove pointless code.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 876bec6f 09-Dec-2016 Darrick J. Wong <darrick.wong@oracle.com>

vfs: refactor clone/dedupe_file_range common functions

Hoist both the XFS reflink inode state and preparation code and the XFS
file blocks compare functions into the VFS so that ocfs2 can take
advantage of it for reflink and dedupe.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>


# 65523218 29-Nov-2016 Christoph Hellwig <hch@lst.de>

xfs: remove i_iolock and use i_rwsem in the VFS inode instead

This patch drops the XFS-own i_iolock and uses the VFS i_rwsem which
recently replaced i_mutex instead. This means we only have to take
one lock instead of two in many fast path operations, and we can
also shrink the xfs_inode structure. Thanks to the xfs_ilock family
there is very little churn, the only thing of note is that we need
to switch to use the lock_two_directory helper for taking the i_rwsem
on two inodes in a few places to make sure our lock order matches
the one used in the VFS.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 0260d8ff 27-Nov-2016 Brian Foster <bfoster@redhat.com>

xfs: clean up cow fork reservation and tag inodes correctly

COW fork reservation is implemented via delayed allocation. The code is
modeled after the traditional delalloc allocation code, but is slightly
different in terms of how preallocation occurs. Rather than post-eof
speculative preallocation, COW fork preallocation is implemented via a
COW extent size hint that is designed to minimize fragmentation as a
reflinked file is split over time.

xfs_reflink_reserve_cow() still uses logic that is oriented towards
dealing with post-eof speculative preallocation, however, and is stale
or not necessarily correct. First, the EOF alignment to the COW extent
size hint is implemented in xfs_bmapi_reserve_delalloc() (which does so
correctly by aligning the start and end offsets) and so is not necessary
in xfs_reflink_reserve_cow(). The backoff and retry logic on ENOSPC is
also ineffective for the same reason, as xfs_bmapi_reserve_delalloc()
will simply perform the same allocation request on the retry. Finally,
since the COW extent size hint aligns the start and end offset of the
range to allocate, the end_fsb != orig_end_fsb logic is not sufficient.
Indeed, if a write request happens to end on an aligned offset, it is
possible that we do not tag the inode for COW preallocation even though
xfs_bmapi_reserve_delalloc() may have preallocated at the start offset.

Kill the unnecessary, duplicate code in xfs_reflink_reserve_cow().
Remove the inode tag logic as well since xfs_bmapi_reserve_delalloc()
has been updated to tag the inode correctly.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 974ae922 27-Nov-2016 Brian Foster <bfoster@redhat.com>

xfs: track preallocation separately in xfs_bmapi_reserve_delalloc()

Speculative preallocation is currently processed entirely by the callers
of xfs_bmapi_reserve_delalloc(). The caller determines how much
preallocation to include, adjusts the extent length and passes down the
resulting request.

While this works fine for post-eof speculative preallocation, it is not
as reliable for COW fork preallocation. COW fork preallocation is
implemented via the cowextszhint, which aligns the start offset as well
as the length of the extent. Further, it is difficult for the caller to
accurately identify when preallocation occurs because the returned
extent could have been merged with neighboring extents in the fork.

To simplify this situation and facilitate further COW fork preallocation
enhancements, update xfs_bmapi_reserve_delalloc() to take a separate
preallocation parameter to incorporate into the allocation request. The
preallocation blocks value is tacked onto the end of the request and
adjusted to accommodate neighboring extents and extent size limits.
Since xfs_bmapi_reserve_delalloc() now knows precisely how much
preallocation was included in the allocation, it can also tag the inodes
appropriately to support preallocation reclaim.

Note that xfs_bmapi_reserve_delalloc() callers are not yet updated to
use the preallocation mechanism. This patch should not change behavior
outside of correctly tagging reflink inodes when start offset
preallocation occurs (which the caller does not handle correctly).

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# fba3e594 27-Nov-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: always succeed when deduping zero bytes

It turns out that btrfs and xfs had differing interpretations of what
to do when the dedupe length is zero. Change xfs to follow btrfs'
semantics so that the userland interface is consistent.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 4ab8671c 23-Nov-2016 Christoph Hellwig <hch@lst.de>

xfs: use new extent lookup helpers in xfs_reflink_end_cow

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# df5ab1b5 23-Nov-2016 Christoph Hellwig <hch@lst.de>

xfs: use new extent lookup helpers in xfs_reflink_cancel_cow_blocks

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 86f12ab0 23-Nov-2016 Christoph Hellwig <hch@lst.de>

xfs: use new extent lookup helpers in xfs_reflink_trim_irec_to_next_cow

And remove the unused return value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 092d5d9d 23-Nov-2016 Christoph Hellwig <hch@lst.de>

xfs: cleanup xfs_reflink_find_cow_mapping

Use xfs_iext_lookup_extent to look up the extent, drop a useless check,
drop a unneeded return value and clean up the general style a little bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 2755fc44 23-Nov-2016 Christoph Hellwig <hch@lst.de>

xfs: use new extent lookup helpers in __xfs_reflink_reserve_cow

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 65c5f419 23-Nov-2016 Christoph Hellwig <hch@lst.de>

xfs: remove prev argument to xfs_bmapi_reserve_delalloc

We can easily lookup the previous extent for the cases where we need it,
which saves the callers from looking it up for us later in the series.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 5d829300 07-Nov-2016 Eric Sandeen <sandeen@sandeen.net>

xfs: provide helper for counting extents from if_bytes

The open-coded pattern:

ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t)

is all over the xfs code; provide a new helper
xfs_iext_count(ifp) to count the number of inline extents
in an inode fork.

[dchinner: pick up several missed conversions]

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 39937234 07-Nov-2016 Brian Foster <bfoster@redhat.com>

xfs: don't skip cow forks w/ delalloc blocks in cowblocks scan

The cowblocks background scanner currently clears the cowblocks tag
for inodes without any real allocations in the cow fork. This
excludes inodes with only delalloc blocks in the cow fork. While we
might never expect to clear delalloc blocks from the cow fork in the
background scanner, it is not necessarily correct to clear the
cowblocks tag from such inodes.

For example, if the background scanner happens to process an inode
between a buffered write and writeback, the scanner catches the
inode in a state after delalloc blocks have been allocated to the
cow fork but before the delalloc blocks have been converted to real
blocks by writeback. The background scanner then incorrectly clears
the cowblocks tag, even if part of the aforementioned delalloc
reservation will not be remapped to the data fork (i.e., extra
blocks due to the cowextsize hint). This means that any such
additional blocks in the cow fork might never be reclaimed by the
background scanner and could persist until the inode itself is
reclaimed.

To address this problem, only skip and clear inodes without any cow
fork allocations whatsoever from the background scanner. While we
generally do not want to cancel delalloc reservations from the
background scanner, the pagecache dirty check following the
cowblocks check should prevent that situation. If we do end up with
delalloc cow fork blocks without a dirty address space mapping, this
is probably an indication that something has gone wrong and the
blocks should be reclaimed, as they may never be converted to a real
allocation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# c17a8ef4 23-Oct-2016 Brian Foster <bfoster@redhat.com>

xfs: clear cowblocks tag when cow fork is emptied

The background cowblocks scan job takes care of scanning for inodes with
potentially lingering blocks in the cow fork and clearing them out. If
the background scanner reclaims the cow fork blocks, however, it doesn't
immediately clear the cowblocks tag from the inode. Instead, the inode
remains tagged until the background scanner comes around again,
discovers the inode cow fork has no blocks, clears the tag and fires the
trace_xfs_inode_free_cowblocks_invalid() tracepoint to indicate that the
inode may have been incorrectly tagged.

This is not a major functional problem as the tag is ultimately cleared.
Nonetheless, clear the tag when an inode cow fork is explicitly emptied
to avoid the extra round trip through the background scanner and
spurious "invalid" tracepoint.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# c1112b6e 19-Oct-2016 Christoph Hellwig <hch@lst.de>

xfs: optimize xfs_reflink_end_cow

Instead of doing a full extent list search for each extent that is
to be deleted using xfs_bmapi_read and then doing another one inside
of xfs_bunmapi_cow use the same scheme that xfs_bumapi uses: look
up the last extent to be deleted and then use the extent index to
walk downward until we are outside the range to be deleted.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 3e0ee78f 19-Oct-2016 Christoph Hellwig <hch@lst.de>

xfs: optimize xfs_reflink_cancel_cow_blocks

Rewrite xfs_reflink_cancel_cow_blocks so that we only do a search for
the first extent in the extent list and then iterate over the remaining
extents using the extent index, passing the extent we operate on
directly to xfs_bmap_del_extent_delay or xfs_bmap_del_extent_cow instead
of going through xfs_bunmapi and doing yet another extent list lookup.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# fa5c836c 19-Oct-2016 Christoph Hellwig <hch@lst.de>

xfs: refactor xfs_bunmapi_cow

Split out two helpers for deleting delayed or real extents from the COW fork.
This allows to call them directly from xfs_reflink_cow_end_io once that
function is refactored to iterate the extent tree. It will also allow
to reuse the delalloc deletion from xfs_bunmapi in the future.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 3ba020be 19-Oct-2016 Christoph Hellwig <hch@lst.de>

xfs: optimize writes to reflink files

Instead of reserving space as the first thing in write_begin move it past
reading the extent in the data fork. That way we only have to read from
the data fork once and can reuse that information for trimming the extent
to the shared/unshared boundary. Additionally this allows to easily
limit the actual write size to said boundary, and avoid a roundtrip on the
ilock.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 62c5ac89 19-Oct-2016 Christoph Hellwig <hch@lst.de>

xfs: handle "raw" delayed extents xfs_reflink_trim_around_shared

Delalloc extents in the extent list contain the number of reserved
indirect blocks in their startblock value and don't use the magic
DELAYSTARTBLOCK constant. Ensure that xfs_reflink_trim_around_shared
handles them properly by checking for isnullstartblock().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 5faaf4fa 19-Oct-2016 Christoph Hellwig <hch@lst.de>

xfs: merge xfs_reflink_remap_range and xfs_file_share_range

There is no clear division of responsibility between those functions, so
just merge them into one to keep the code simple. Also move
xfs_file_wait_for_io to xfs_reflink.c together with its only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 57617781 19-Oct-2016 Christoph Hellwig <hch@lst.de>

xfs: move inode locking from xfs_reflink_remap_range to xfs_file_share_range

We need the iolock protection to stabilizie the IS_SWAPFILE and
IS_IMMUTABLE values, as well as preventing new buffered writers
re-dirtying the file data that we just wrote out.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 1be7f9be 19-Oct-2016 Geert Uytterhoeven <geert@linux-m68k.org>

xfs: Fix uninitialized variable in xfs_reflink_reserve_cow_range()

with gcc 4.1.2:

fs/xfs/xfs_reflink.c: In function xfs_reflink_reserve_cow_range:
fs/xfs/xfs_reflink.c:327: warning: error may be used uninitialized in this function

Indeed, if "count" is zero, the function will return an uninitialized
error value.

While "count" is unlikely to be zero, this function is called through
the public iomap API. Hence fix this by preinitializing error to zero.

Fixes: 2a06705cd5954030 ("xfs: create delalloc extents in CoW fork")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 9780643c 09-Oct-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: fix error initialization

Eric Sandeen reported a gcc complaint about uninitialized error
variables, so fix that.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 97a1b87e 09-Oct-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: remove isize check from unshare operation

Now that fallocate has an explicit unshare flag again, let's try
to remove the inode reflink flag whenever the user unshares any
part of a file since checking is cheap compared to the CoW.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 024adf48 09-Oct-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: reduce stack usage of _reflink_clear_inode_flag

The loop in _reflink_clear_inode_flag isn't necessary since we
jump out if any part of any extent is shared. Remove the loop
and we no longer need two maps, so we can save some stack use.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 63646fc5 09-Oct-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: check inode reflink flag before calling reflink functions

There are a couple of places where we don't check the inode's
reflink flag before calling into the reflink code. Fix those,
and add some asserts so we don't make this mistake again.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>


# 83104d44 03-Oct-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: garbage collect old cowextsz reservations

Trim CoW reservations made on behalf of a cowextsz hint if they get too
old or we run low on quota, so long as we don't have dirty data awaiting
writeback or directio operations in progress.

Garbage collection of the cowextsize extents are kept separate from
prealloc extent reaping because setting the CoW prealloc lifetime to a
(much) higher value than the regular prealloc extent lifetime has been
useful for combatting CoW fragmentation on VM hosts where the VMs
experience bursty write behaviors and we can keep the utilization ratios
low enough that we don't start to run out of space. IOWs, it benefits
us to keep the CoW fork reservations around for as long as we can unless
we run out of blocks or hit inode reclaim.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 6fa164b8 03-Oct-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: don't allow reflink when the AG is low on space

If the AG free space is down to the reserves, refuse to reflink our
way out of space. Hopefully userspace will make a real copy and/or go
elsewhere.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# f7ca3522 03-Oct-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: create a separate cow extent size hint for the allocator

Create a per-inode extent size allocator hint for copy-on-write. This
hint is separate from the existing extent size hint so that CoW can
take advantage of the fragmentation-reducing properties of extent size
hints without disabling delalloc for regular writes.

The extent size hint that's fed to the allocator during a copy on
write operation is the greater of the cowextsize and regular extsize
hint.

During reflink, if we're sharing the entire source file to the entire
destination file and the destination file doesn't already have a
cowextsize hint, propagate the source file's cowextsize hint to the
destination file.

Furthermore, zero the bulkstat buffer prior to setting the fields
so that we don't copy kernel memory contents into userspace.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 98cc2db5 03-Oct-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: unshare a range of blocks via fallocate

Unshare all shared extents if the user calls fallocate with the new
unshare mode flag set, so that we can guarantee that a subsequent
write will not ENOSPC.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: pass inode instead of file to xfs_reflink_dirty_range,
use iomap infrastructure for copy up]
Signed-off-by: Christoph Hellwig <hch@lst.de>


# cc714660 03-Oct-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: add dedupe range vfs function

Define a VFS function which allows userspace to request that the
kernel reflink a range of blocks between two files if the ranges'
contents match. The function fits the new VFS ioctl that standardizes
the checking for the btrfs EXTENT SAME ioctl.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 862bb360 03-Oct-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: reflink extents from one file to another

Reflink extents from one file to another; that is to say, iteratively
remove the mappings from the destination file, copy the mappings from
the source file to the destination file, and increment the reference
count of all the blocks that got remapped.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 174edb0e 03-Oct-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: store in-progress CoW allocations in the refcount btree

Due to the way the CoW algorithm in XFS works, there's an interval
during which blocks allocated to handle a CoW can be lost -- if the FS
goes down after the blocks are allocated but before the block
remapping takes place. This is exacerbated by the cowextsz hint --
allocated reservations can sit around for a while, waiting to get
used.

Since the refcount btree doesn't normally store records with refcount
of 1, we can use it to record these in-progress extents. In-progress
blocks cannot be shared because they're not user-visible, so there
shouldn't be any conflicts with other programs. This is a better
solution than holding EFIs during writeback because (a) EFIs can't be
relogged currently, (b) even if they could, EFIs are bound by
available log space, which puts an unnecessary upper bound on how much
CoW we can have in flight, and (c) we already have a mechanism to
track blocks.

At mount time, read the refcount records and free anything we find
with a refcount of 1 because those were in-progress when the FS went
down.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 0613f16c 03-Oct-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: implement CoW for directio writes

For O_DIRECT writes to shared blocks, we have to CoW them just like
we would with buffered writes. For writes that are not block-aligned,
just bounce them to the page cache.

For block-aligned writes, however, we can do better than that. Use
the same mechanisms that we employ for buffered CoW to set up a
delalloc reservation, allocate all the blocks at once, issue the
writes against the new blocks and use the same ioend functions to
remap the blocks after the write. This should be fairly performant.

Christoph discovered that xfs_reflink_allocate_cow_range may stumble
over invalid entries in the extent array given that it drops the ilock
but still expects the index to be stable. Simple fixing it to a new
lookup for every iteration still isn't correct given that
xfs_bmapi_allocate will trigger a BUG_ON() if hitting a hole, and
there is nothing preventing a xfs_bunmapi_cow call removing extents
once we dropped the ilock either.

This patch duplicates the inner loop of xfs_bmapi_allocate into a
helper for xfs_reflink_allocate_cow_range so that it can be done under
the same ilock critical section as our CoW fork delayed allocation.
The directio CoW warts will be revisited in a later patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>


# 43caeb18 03-Oct-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: move mappings from cow fork to data fork after copy-write

After the write component of a copy-write operation finishes, clean up
the bookkeeping left behind. On error, we simply free the new blocks
and pass the error up. If we succeed, however, then we must remove
the old data fork mapping and move the cow fork mapping to the data
fork.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: Call the CoW failure function during xfs_cancel_ioend]
Signed-off-by: Christoph Hellwig <hch@lst.de>


# ef473667 03-Oct-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: allocate delayed extents in CoW fork

Modify the writepage handler to find and convert pending delalloc
extents to real allocations. Furthermore, when we're doing non-cow
writes to a part of a file that already has a CoW reservation (the
cowextsz hint that we set up in a subsequent patch facilitates this),
promote the write to copy-on-write so that the entire extent can get
written out as a single extent on disk, thereby reducing post-CoW
fragmentation.

Christoph moved the CoW support code in _map_blocks to a separate helper
function, refactored other functions, and reduced the number of CoW fork
lookups, so I merged those changes here to reduce churn.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>


# 2a06705c 03-Oct-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: create delalloc extents in CoW fork

Wire up iomap_begin to detect shared extents and create delayed allocation
extents in the CoW fork:

1) Check if we already have an extent in the COW fork for the area.
If so nothing to do, we can move along.
2) Look up block number for the current extent, and if there is none
it's not shared move along.
3) Unshare the current extent as far as we are going to write into it.
For this we avoid an additional COW fork lookup and use the
information we set aside in step 1) above.
4) Goto 1) unless we've covered the whole range.

Last but not least, this updates the xfs_reflink_reserve_cow_range calling
convention to pass a byte offset and length, as that is what both callers
expect anyway. This patch has been refactored considerably as part of the
iomap transition.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>


# 3993baeb 03-Oct-2016 Darrick J. Wong <darrick.wong@oracle.com>

xfs: introduce the CoW fork

Introduce a new in-core fork for storing copy-on-write delalloc
reservations and allocated extents that are in the process of being
written out.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>