History log of /linux-master/fs/netfs/write_collect.c
Revision Date Author Comments
# 9b038d00 21-May-2024 David Howells <dhowells@redhat.com>

netfs: Fix io_uring based write-through

This can be triggered by mounting a cifs filesystem with a cache=strict
mount option and then, using the fsx program from xfstests, doing:

ltp/fsx -A -d -N 1000 -S 11463 -P /tmp /cifs-mount/foo \
--replay-ops=gen112-fsxops

Where gen112-fsxops holds:

fallocate 0x6be7 0x8fc5 0x377d3
copy_range 0x9c71 0x77e8 0x2edaf 0x377d3
write 0x2776d 0x8f65 0x377d3

The problem is that netfs_io_request::len is being used for two purposes
and ends up getting set to the amount of data we transferred, not the
amount of data the caller asked to be transferred (for various reasons,
such as mmap'd writes, we might end up rounding out the data written to the
server to include the entire folio at each end).

Fix this by keeping the amount we were asked to write in ->len and using
->submitted to track what we issued ops for. Then, when we come to calling
->ki_complete(), ->len is the right size.

This also required netfs_cleanup_dio_write() to change since we're no
longer advancing wreq->len. Use wreq->transferred instead as we might have
done a short read.

With this, the generic/112 xfstest passes if cifs is forced to put all
non-DIO opens into write-through mode.

Fixes: 288ace2f57c9 ("netfs: New writeback implementation")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/295086.1716298663@warthog.procyon.org.uk
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <stfrench@microsoft.com>
cc: Enzo Matsumiya <ematsumiya@suse.de>
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>


# 1ecb146f 15-Mar-2024 David Howells <dhowells@redhat.com>

netfs, afs: Use writeback retry to deal with alternate keys

Use a hook in the new writeback code's retry algorithm to rotate the keys
once all the outstanding subreqs have failed rather than doing it
separately on each subreq.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org


# 2df86547 07-Mar-2024 David Howells <dhowells@redhat.com>

netfs: Cut over to using new writeback code

Cut over to using the new writeback code. The old code is #ifdef'd out or
otherwise removed from compilation to avoid conflicts and will be removed
in a future patch.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org


# 288ace2f 18-Mar-2024 David Howells <dhowells@redhat.com>

netfs: New writeback implementation

The current netfslib writeback implementation creates writeback requests of
contiguous folio data and then separately tiles subrequests over the space
twice, once for the server and once for the cache. This creates a few
issues:

(1) Every time there's a discontiguity or a change between writing to only
one destination or writing to both, it must create a new request.
This makes it harder to do vectored writes.

(2) The folios don't have the writeback mark removed until the end of the
request - and a request could be hundreds of megabytes.

(3) In future, I want to support a larger cache granularity, which will
require aggregation of some folios that contain unmodified data (which
only need to go to the cache) and some which contain modifications
(which need to be uploaded and stored to the cache) - but, currently,
these are treated as discontiguous.

There's also a move to get everyone to use writeback_iter() to extract
writable folios from the pagecache. That said, currently writeback_iter()
has some issues that make it less than ideal:

(1) there's no way to cancel the iteration, even if you find a "temporary"
error that means the current folio and all subsequent folios are going
to fail;

(2) there's no way to filter the folios being written back - something
that will impact Ceph with it's ordered snap system;

(3) and if you get a folio you can't immediately deal with (say you need
to flush the preceding writes), you are left with a folio hanging in
the locked state for the duration, when really we should unlock it and
relock it later.

In this new implementation, I use writeback_iter() to pump folios,
progressively creating two parallel, but separate streams and cleaning up
the finished folios as the subrequests complete. Either or both streams
can contain gaps, and the subrequests in each stream can be of variable
size, don't need to align with each other and don't need to align with the
folios.

Indeed, subrequests can cross folio boundaries, may cover several folios or
a folio may be spanned by multiple folios, e.g.:

+---+---+-----+-----+---+----------+
Folios: | | | | | | |
+---+---+-----+-----+---+----------+

+------+------+ +----+----+
Upload: | | |.....| | |
+------+------+ +----+----+

+------+------+------+------+------+
Cache: | | | | | |
+------+------+------+------+------+

The progressive subrequest construction permits the algorithm to be
preparing both the next upload to the server and the next write to the
cache whilst the previous ones are already in progress. Throttling can be
applied to control the rate of production of subrequests - and, in any
case, we probably want to write them to the server in ascending order,
particularly if the file will be extended.

Content crypto can also be prepared at the same time as the subrequests and
run asynchronously, with the prepped requests being stalled until the
crypto catches up with them. This might also be useful for transport
crypto, but that happens at a lower layer, so probably would be harder to
pull off.

The algorithm is split into three parts:

(1) The issuer. This walks through the data, packaging it up, encrypting
it and creating subrequests. The part of this that generates
subrequests only deals with file positions and spans and so is usable
for DIO/unbuffered writes as well as buffered writes.

(2) The collector. This asynchronously collects completed subrequests,
unlocks folios, frees crypto buffers and performs any retries. This
runs in a work queue so that the issuer can return to the caller for
writeback (so that the VM can have its kswapd thread back) or async
writes.

(3) The retryer. This pauses the issuer, waits for all outstanding
subrequests to complete and then goes through the failed subrequests
to reissue them. This may involve reprepping them (with cifs, the
credits must be renegotiated, and a subrequest may need splitting),
and doing RMW for content crypto if there's a conflicting change on
the server.

[!] Note that some of the functions are prefixed with "new_" to avoid
clashes with existing functions. These will be renamed in a later patch
that cuts over to the new algorithm.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org