History log of /linux-master/fs/zonefs/super.c
Revision Date Author Comments
# 60b703c7 01-Apr-2024 Thorsten Blum <thorsten.blum@toblux.com>

zonefs: Use str_plural() to fix Coccinelle warning

Fixes the following Coccinelle/coccicheck warning reported by
string_choices.cocci:

opportunity for str_plural(zgroup->g_nr_zones)

Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>


# f88c3fb8 12-Mar-2024 Linus Torvalds <torvalds@linux-foundation.org>

mm, slab: remove last vestiges of SLAB_MEM_SPREAD

Yes, yes, I know the slab people were planning on going slow and letting
every subsystem fight this thing on their own. But let's just rip off
the band-aid and get it over and done with. I don't want to see a
number of unnecessary pull requests just to get rid of a flag that no
longer has any meaning.

This was mainly done with a couple of 'sed' scripts and then some manual
cleanup of the end result.

Link: https://lore.kernel.org/all/CAHk-=wji0u+OOtmAOD-5JV3SXcRJF___k_+8XNKmak0yd5vW1Q@mail.gmail.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 567e629f 09-Feb-2024 Bill O'Donnell <bodonnel@redhat.com>

zonefs: convert zonefs to use the new mount api

Convert the zonefs filesystem to use the new mount API.
Tested using the zonefs test suite from:
https://github.com/damien-lemoal/zonefs-tools

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
Reviewed-by: Ian Kent <raven@themaw.net>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>


# 71f4ecdb 29-Jan-2024 Johannes Thumshirn <johannes.thumshirn@wdc.com>

block: remove gfp_flags from blkdev_zone_mgmt

Now that all callers pass in GFP_KERNEL to blkdev_zone_mgmt() and use
memalloc_no{io,fs}_{save,restore}() to define the allocation scope, we can
drop the gfp_mask parameter from blkdev_zone_mgmt() as well as
blkdev_zone_reset_all() and blkdev_zone_reset_all_emulated().

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20240128-zonefs_nofs-v3-5-ae3b7c8def61@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 9105ce59 29-Jan-2024 Johannes Thumshirn <johannes.thumshirn@wdc.com>

zonefs: pass GFP_KERNEL to blkdev_zone_mgmt() call

Pass GFP_KERNEL instead of GFP_NOFS to the blkdev_zone_mgmt() call in
zonefs_zone_mgmt().

As as zonefs_zone_mgmt() and zonefs_inode_zone_mgmt() are never called
from a place that can recurse back into the filesystem on memory reclaim,
it is save to call blkdev_zone_mgmt() with GFP_KERNEL.

Link: https://lore.kernel.org/all/ZZcgXI46AinlcBDP@casper.infradead.org/
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240128-zonefs_nofs-v3-1-ae3b7c8def61@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 14db5f64 08-Feb-2024 Damien Le Moal <dlemoal@kernel.org>

zonefs: Improve error handling

Write error handling is racy and can sometime lead to the error recovery
path wrongly changing the inode size of a sequential zone file to an
incorrect value which results in garbage data being readable at the end
of a file. There are 2 problems:

1) zonefs_file_dio_write() updates a zone file write pointer offset
after issuing a direct IO with iomap_dio_rw(). This update is done
only if the IO succeed for synchronous direct writes. However, for
asynchronous direct writes, the update is done without waiting for
the IO completion so that the next asynchronous IO can be
immediately issued. However, if an asynchronous IO completes with a
failure right before the i_truncate_mutex lock protecting the update,
the update may change the value of the inode write pointer offset
that was corrected by the error path (zonefs_io_error() function).

2) zonefs_io_error() is called when a read or write error occurs. This
function executes a report zone operation using the callback function
zonefs_io_error_cb(), which does all the error recovery handling
based on the current zone condition, write pointer position and
according to the mount options being used. However, depending on the
zoned device being used, a report zone callback may be executed in a
context that is different from the context of __zonefs_io_error(). As
a result, zonefs_io_error_cb() may be executed without the inode
truncate mutex lock held, which can lead to invalid error processing.

Fix both problems as follows:
- Problem 1: Perform the inode write pointer offset update before a
direct write is issued with iomap_dio_rw(). This is safe to do as
partial direct writes are not supported (IOMAP_DIO_PARTIAL is not
set) and any failed IO will trigger the execution of zonefs_io_error()
which will correct the inode write pointer offset to reflect the
current state of the one on the device.
- Problem 2: Change zonefs_io_error_cb() into zonefs_handle_io_error()
and call this function directly from __zonefs_io_error() after
obtaining the zone information using blkdev_report_zones() with a
simple callback function that copies to a local stack variable the
struct blk_zone obtained from the device. This ensures that error
handling is performed holding the inode truncate mutex.
This change also simplifies error handling for conventional zone files
by bypassing the execution of report zones entirely. This is safe to
do because the condition of conventional zones cannot be read-only or
offline and conventional zone files are always fully mapped with a
constant file size.

Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>


# 00488aa2 10-Nov-2023 Al Viro <viro@zeniv.linux.org.uk>

zonefs: d_splice_alias() will do the right thing on ERR_PTR() inode

Acked-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# 8df379a3 04-Oct-2023 Jeff Layton <jlayton@kernel.org>

zonefs: convert to new timestamp accessors

Convert to using the new inode timestamp accessor functions.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20231004185347.80880-76-jlayton@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>


# f74207d8 05-Jul-2023 Jeff Layton <jlayton@kernel.org>

zonefs: convert to ctime accessor functions

In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.

Acked-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230705190309.579783-81-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# fe9da61f 06-Aug-2023 Damien Le Moal <dlemoal@kernel.org>

zonefs: fix synchronous direct writes to sequential files

Commit 16d7fd3cfa72 ("zonefs: use iomap for synchronous direct writes")
changes zonefs code from a self-built zone append BIO to using iomap for
synchronous direct writes. This change relies on iomap submit BIO
callback to change the write BIO built by iomap to a zone append BIO.
However, this change overlooked the fact that a write BIO may be very
large as it is split when issued. The change from a regular write to a
zone append operation for the built BIO can result in a block layer
warning as zone append BIO are not allowed to be split.

WARNING: CPU: 18 PID: 202210 at block/bio.c:1644 bio_split+0x288/0x350
Call Trace:
? __warn+0xc9/0x2b0
? bio_split+0x288/0x350
? report_bug+0x2e6/0x390
? handle_bug+0x41/0x80
? exc_invalid_op+0x13/0x40
? asm_exc_invalid_op+0x16/0x20
? bio_split+0x288/0x350
bio_split_rw+0x4bc/0x810
? __pfx_bio_split_rw+0x10/0x10
? lockdep_unlock+0xf2/0x250
__bio_split_to_limits+0x1d8/0x900
blk_mq_submit_bio+0x1cf/0x18a0
? __pfx_iov_iter_extract_pages+0x10/0x10
? __pfx_blk_mq_submit_bio+0x10/0x10
? find_held_lock+0x2d/0x110
? lock_release+0x362/0x620
? mark_held_locks+0x9e/0xe0
__submit_bio+0x1ea/0x290
? __pfx___submit_bio+0x10/0x10
? seqcount_lockdep_reader_access.constprop.0+0x82/0x90
submit_bio_noacct_nocheck+0x675/0xa20
? __pfx_bio_iov_iter_get_pages+0x10/0x10
? __pfx_submit_bio_noacct_nocheck+0x10/0x10
iomap_dio_bio_iter+0x624/0x1280
__iomap_dio_rw+0xa22/0x18a0
? lock_is_held_type+0xe3/0x140
? __pfx___iomap_dio_rw+0x10/0x10
? lock_release+0x362/0x620
? zonefs_file_write_iter+0x74c/0xc80 [zonefs]
? down_write+0x13d/0x1e0
iomap_dio_rw+0xe/0x40
zonefs_file_write_iter+0x5ea/0xc80 [zonefs]
do_iter_readv_writev+0x18b/0x2c0
? __pfx_do_iter_readv_writev+0x10/0x10
? inode_security+0x54/0xf0
do_iter_write+0x13b/0x7c0
? lock_is_held_type+0xe3/0x140
vfs_writev+0x185/0x550
? __pfx_vfs_writev+0x10/0x10
? __handle_mm_fault+0x9bd/0x1c90
? find_held_lock+0x2d/0x110
? lock_release+0x362/0x620
? find_held_lock+0x2d/0x110
? lock_release+0x362/0x620
? __up_read+0x1ea/0x720
? do_pwritev+0x136/0x1f0
do_pwritev+0x136/0x1f0
? __pfx_do_pwritev+0x10/0x10
? syscall_enter_from_user_mode+0x22/0x90
? lockdep_hardirqs_on+0x7d/0x100
do_syscall_64+0x58/0x80

This error depends on the hardware used, specifically on the max zone
append bytes and max_[hw_]sectors limits. Tests using AMD Epyc machines
that have low limits did not reveal this issue while runs on Intel Xeon
machines with larger limits trigger it.

Manually splitting the zone append BIO using bio_split_rw() can solve
this issue but also requires issuing the fragment BIOs synchronously
with submit_bio_wait(), to avoid potential reordering of the zone append
BIO fragments, which would lead to data corruption. That is, this
solution is not better than using regular write BIOs which are subject
to serialization using zone write locking at the IO scheduler level.

Given this, fix the issue by removing zone append support and using
regular write BIOs for synchronous direct writes. This allows preseving
the use of iomap and having identical synchronous and asynchronous
sequential file write path. Zone append support will be reintroduced
later through io_uring commands to ensure that the needed special
handling is done correctly.

Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 16d7fd3cfa72 ("zonefs: use iomap for synchronous direct writes")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 16d7fd3c 01-Jun-2023 Damien Le Moal <dlemoal@kernel.org>

zonefs: use iomap for synchronous direct writes

Remove the function zonefs_file_dio_append() that is used to manually
issue REQ_OP_ZONE_APPEND BIOs for processing synchronous direct writes
and use iomap instead.

To preserve the use of zone append operations for synchronous writes,
different struct iomap_dio_ops are defined. For synchronous direct
writes using zone append, zonefs_zone_append_dio_ops is introduced.
The submit_bio operation of this structure is defined as the function
zonefs_file_zone_append_dio_submit_io() which is used to change the BIO
opreation for synchronous direct IO writes to REQ_OP_ZONE_APPEND.

In order to preserve the write location check on completion of zone
append BIOs, the end_io operation is also defined using the function
zonefs_file_zone_append_dio_bio_end_io(). This check now relies on the
zonefs_zone_append_bio structure, allocated together with zone append
BIOs with a dedicated BIO set. This structure include the target inode
of a zone append BIO as well as the target append offset location for
the zone append operation. This is used to perform a check against
bio->bi_iter.bi_sector when the BIO completes, without needing to use
the zone information z_wpoffset field, thus removing the need for
taking the inode truncate mutex.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>


# 0fa5b08c 31-May-2023 Johannes Thumshirn <johannes.thumshirn@wdc.com>

zonefs: use __bio_add_page for adding single page to bio

The zonefs superblock reading code uses bio_add_page() to add a page to a
newly created bio. bio_add_page() can fail, but the return value is
never checked.

Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.

This brings us a step closer to marking bio_add_page() as __must_check.

Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/04c9978ccaa0fc9871cd4248356638d98daccf0c.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 43592c46 04-Jan-2023 Damien Le Moal <damien.lemoal@opensource.wdc.com>

zonefs: Cache zone group directory inodes

Since looking up any zone file inode requires looking up first the inode
for the directory representing the zone group of the file, ensuring that
the zone group inodes are always cached is desired. To do so, take an
extra reference on the zone groups directory inodes on mount, thus
avoiding the eviction of these inodes from the inode cache until the
volume is unmounted.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>


# d207794a 29-Nov-2022 Damien Le Moal <damien.lemoal@opensource.wdc.com>

zonefs: Dynamically create file inodes when needed

Allocating and initializing all inodes and dentries for all files
results in a very large memory usage with high capacity zoned block
devices. For instance, with a 26 TB SMR HDD with over 96000 zones,
mounting the disk with zonefs results in about 130 MB of memory used,
the vast majority of this space being used for vfs inodes and dentries.

However, since a user will rarely access all zones at the same time,
dynamically creating file inodes and dentries on demand, similarly to
regular file systems, can significantly reduce memory usage.

This patch modifies mount processing to not create the inodes and
dentries for zone files. Instead, the directory inode operation
zonefs_lookup() and directory file operation zonefs_readdir() are
introduced to allocate and initialize inodes on-demand using the helper
functions zonefs_get_dir_inode() and zonefs_get_zgroup_inode().

Implementation of these functions is simple, relying on the static
nature of zonefs directories and files. Directory inodes are linked to
the volume zone groups (struct zonefs_zone_group) they represent by
using the directory inode i_private field. This simplifies the
implementation of the lookup and readdir operations.

Unreferenced zone file inodes can be evicted from the inode cache at any
time. In such case, the only inode information that cannot be recreated
from the zone information that is saved in the zone group data
structures attached to the volume super block is the inode uid, gid and
access rights. These values may have been changed by the user. To keep
these attributes for the life time of the mount, as before, the inode
mode, uid and gid are saved in the inode zone information and the saved
values are used to initialize regular file inodes when an inode lookup
happens. The zone information mode, uid and gid are initialized in
zonefs_init_zgroup() using the default values.

With these changes, the static minimal memory usage of a zonefs volume
is mostly reduced to the array of zone information for each zone group.
For the 26 TB SMR hard-disk mentioned above, the memory usage after
mount becomes about 5.4 MB, a reduction by a factor of 24 from the
initial 130 MB memory use.

Co-developed-by: Jorgen Hansen <Jorgen.Hansen@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>


# aa7f243f 16-Nov-2022 Damien Le Moal <damien.lemoal@opensource.wdc.com>

zonefs: Separate zone information from inode information

In preparation for adding dynamic inode allocation, separate an inode
zone information from the zonefs inode structure. The new data structure
zonefs_zone is introduced to store in memory information about a zone
that must be kept throughout the lifetime of the device mount.

Linking between a zone file inode and its zone information is done by
setting the inode i_private field to point to a struct zonefs_zone.
Using the i_private pointer avoids the need for adding a pointer in
struct zonefs_inode_info. Beside the vfs inode, this structure is
reduced to a mutex and a write open counter.

One struct zonefs_zone is created per file inode on mount. These
structures are organized in an array using the new struct
zonefs_zone_group data structure to represent zone groups. The
zonefs_zone arrays are indexed per file number (the index of a struct
zonefs_zone in its array directly gives the file number/name for that
zone file inode).

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>


# 34422914 24-Nov-2022 Damien Le Moal <damien.lemoal@opensource.wdc.com>

zonefs: Reduce struct zonefs_inode_info size

Instead of using the i_ztype field in struct zonefs_inode_info to
indicate the zone type of an inode, introduce the new inode flag
ZONEFS_ZONE_CNV to be set in the i_flags field of struct
zonefs_inode_info to identify conventional zones. If this flag is not
set, the zone of an inode is considered to be a sequential zone.

The helpers zonefs_zone_is_cnv(), zonefs_zone_is_seq(),
zonefs_inode_is_cnv() and zonefs_inode_is_seq() are introduced to
simplify testing the zone type of a struct zonefs_inode_info and of a
struct inode.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>


# 46a9c526 24-Nov-2022 Damien Le Moal <damien.lemoal@opensource.wdc.com>

zonefs: Simplify IO error handling

Simplify zonefs_check_zone_condition() by moving the code that changes
an inode access rights to the new function zonefs_inode_update_mode().
Furthermore, since on mount an inode wpoffset is always zero when
zonefs_check_zone_condition() is called during an inode initialization,
the "mount" boolean argument is not necessary for the readonly zone
case. This argument is thus removed.

zonefs_io_error_cb() is also modified to use the inode offline and
zone state flags instead of checking the device zone condition. The
multiple calls to zonefs_check_zone_condition() are reduced to the first
call on entry, which allows removing the "warn" argument.
zonefs_inode_update_mode() is also used to update an inode access rights
as zonefs_io_error_cb() modifies the inode flags depending on the volume
error handling mode (defined with a mount option). Since an inode mode
change differs for read-only zones between mount time and IO error time,
the flag ZONEFS_ZONE_INIT_MODE is used to differentiate both cases.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>


# 4008e2a0 24-Nov-2022 Damien Le Moal <damien.lemoal@opensource.wdc.com>

zonefs: Reorganize code

Move all code related to zone file operations from super.c to the new
file.c file. Inode and zone management code remains in super.c.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>


# f861646a 12-Jan-2023 Christian Brauner <brauner@kernel.org>

quota: port to mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# f2d40141 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port inode_init_owner() to mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# c1632a0f 12-Jan-2023 Christian Brauner <brauner@kernel.org>

fs: port ->setattr() to pass mnt_idmap

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# a608da3b 06-Jan-2023 Damien Le Moal <damien.lemoal@opensource.wdc.com>

zonefs: Detect append writes at invalid locations

Using REQ_OP_ZONE_APPEND operations for synchronous writes to sequential
files succeeds regardless of the zone write pointer position, as long as
the target zone is not full. This means that if an external (buggy)
application writes to the zone of a sequential file underneath the file
system, subsequent file write() operation will succeed but the file size
will not be correct and the file will contain invalid data written by
another application.

Modify zonefs_file_dio_append() to check the written sector of an append
write (returned in bio->bi_iter.bi_sector) and return -EIO if there is a
mismatch with the file zone wp offset field. This change triggers a call
to zonefs_io_error() and a zone check. Modify zonefs_io_error_cb() to
not expose the unexpected data after the current inode size when the
errors=remount-ro mode is used. Other error modes are correctly handled
already.

Fixes: 02ef12a663c7 ("zonefs: use REQ_OP_ZONE_APPEND for sync DIO")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>


# db58653c 21-Nov-2022 Damien Le Moal <damien.lemoal@opensource.wdc.com>

zonefs: Fix active zone accounting

If a file zone transitions to the offline or readonly state from an
active state, we must clear the zone active flag and decrement the
active seq file counter. Do so in zonefs_account_active() using the new
zonefs inode flags ZONEFS_ZONE_OFFLINE and ZONEFS_ZONE_READONLY. These
flags are set if necessary in zonefs_check_zone_condition() based on the
result of report zones operation after an IO error.

Fixes: 87c9ce3ffec9 ("zonefs: Add active seq file accounting")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>


# 4e458869 20-Nov-2022 Zhang Xiaoxu <zhangxiaoxu5@huawei.com>

zonefs: Fix race between modprobe and mount

There is a race between modprobe and mount as below:

modprobe zonefs | mount -t zonefs
--------------------------------|-------------------------
zonefs_init |
register_filesystem [1] |
| zonefs_fill_super [2]
zonefs_sysfs_init [3] |

1. register zonefs suceess, then
2. user can mount the zonefs
3. if sysfs initialize failed, the module initialize failed.

Then the mount process maybe some error happened since the module
initialize failed.

Let's register zonefs after all dependency resource ready. And
reorder the dependency resource release in module exit.

Fixes: 9277a6d4fbd4 ("zonefs: Export open zone resource information through sysfs")
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>


# 7dd12d65 24-Oct-2022 Damien Le Moal <damien.lemoal@opensource.wdc.com>

zonefs: fix zone report size in __zonefs_io_error()

When an IO error occurs, the function __zonefs_io_error() is used to
issue a zone report to obtain the latest zone information from the
device. This function gets a zone report for all zones used as storage
for a file, which is always 1 zone except for files representing
aggregated conventional zones.

The number of zones of a zone report for a file is calculated in
__zonefs_io_error() by doing a bit-shift of the inode i_zone_size field,
which is equal to or larger than the device zone size. However, this
calculation does not take into account that the last zone of a zoned
device may be smaller than the zone size reported by bdev_zone_sectors()
(which is used to set the bit shift size). As a result, if an error
occurs for an IO targetting such last smaller zone, the zone report will
ask for 0 zones, leading to an invalid zone report.

Fix this by using the fact that all files require a 1 zone report,
except if the inode i_zone_size field indicates a zone size larger than
the device zone size. This exception case corresponds to a mount with
aggregated conventional zones.

A check for this exception is added to the file inode initialization
during mount. If an invalid setup is detected, emit an error and fail
the mount (check contributed by Johannes Thumshirn).

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>


# 7b86e8a5 22-Jul-2022 Christoph Hellwig <hch@lst.de>

zonefs: remove ->writepage

->writepage is only used for single page writeback from memory reclaim,
and not called at all for cgroup writeback. Follow the lead of XFS
and remove ->writepage and rely entirely on ->writepages.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>


# 2ec810d5 05-Jun-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

mm/migrate: Add filemap_migrate_folio()

There is nothing iomap-specific about iomap_migratepage(), and it fits
a pattern used by several other filesystems, so move it to mm/migrate.c,
convert it to be filemap_migrate_folio() and convert the iomap filesystems
to use it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>


# 6bac30bb 05-Jul-2022 Fabio M. De Francesco <fmdefrancesco@gmail.com>

zonefs: Call page_address() on page acquired with GFP_KERNEL flag

zonefs_read_super() acquires a page with alloc_page(GFP_KERNEL). That
page cannot come from ZONE_HIGHMEM, thus there's no need to map it with
kmap().

Therefore, use a plain page_address() on that page.

Suggested-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>


# 91b94c5d 22-May-2022 Al Viro <viro@zeniv.linux.org.uk>

iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC

New helper to be used instead of direct checks for IOCB_DSYNC:
iocb_is_dsync(iocb). Checks converted, which allows to avoid
the IS_SYNC(iocb->ki_filp->f_mapping->host) part (4 cache lines)
from iocb_flags() - it's checked in iocb_is_dsync() instead

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# ff07a02e 14-Jul-2022 Bart Van Assche <bvanassche@acm.org>

treewide: Rename enum req_opf into enum req_op

The type name enum req_opf is misleading since it suggests that values of
this type include both an operation type and flags. Since values of this
type represent an operation only, change the type name into enum req_op.

Convert the enum req_op documentation into kernel-doc format. Move a few
definitions such that the enum req_op documentation occurs just above
the enum req_op definition.

The name "req_opf" was introduced by commit ef295ecf090d ("block: better op
and flags encoding").

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b623e347 06-Jul-2022 Christoph Hellwig <hch@lst.de>

block: replace blkdev_nr_zones with bdev_nr_zones

Pass a block_device instead of a request_queue as that is what most
callers have at hand.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20220706070350.1703384-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b27c82e1 21-Jun-2022 Christian Brauner <brauner@kernel.org>

attr: port attribute changes to new types

Now that we introduced new infrastructure to increase the type safety
for filesystems supporting idmapped mounts port the first part of the
vfs over to them.

This ports the attribute changes codepaths to rely on the new better
helpers using a dedicated type.

Before this change we used to take a shortcut and place the actual
values that would be written to inode->i_{g,u}id into struct iattr. This
had the advantage that we moved idmappings mostly out of the picture
early on but it made reasoning about changes more difficult than it
should be.

The filesystem was never explicitly told that it dealt with an idmapped
mount. The transition to the value that needed to be stored in
inode->i_{g,u}id appeared way too early and increased the probability of
bugs in various codepaths.

We know place the same value in struct iattr no matter if this is an
idmapped mount or not. The vfs will only deal with type safe
vfs{g,u}id_t. This makes it massively safer to perform permission checks
as the type will tell us what checks we need to perform and what helpers
we need to use.

Fileystems raising FS_ALLOW_IDMAP can't simply write ia_vfs{g,u}id to
inode->i_{g,u}id since they are different types. Instead they need to
use the dedicated vfs{g,u}id_to_k{g,u}id() helpers that map the
vfs{g,u}id into the filesystem.

The other nice effect is that filesystems like overlayfs don't need to
care about idmappings explicitly anymore and can simply set up struct
iattr accordingly directly.

Link: https://lore.kernel.org/lkml/CAHk-=win6+ahs1EwLkcq8apqLi_1wXFWbrPf340zYEhObpz4jA@mail.gmail.com [1]
Link: https://lore.kernel.org/r/20220621141454.2914719-9-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# 71e7b535 21-Jun-2022 Christian Brauner <brauner@kernel.org>

quota: port quota helpers mount ids

Port the is_quota_modification() and dqout_transfer() helper to type
safe vfs{g,u}id_t. Since these helpers are only called by a few
filesystems don't introduce a new helper but simply extend the existing
helpers to pass down the mount's idmapping.

Note, that this is a non-functional change, i.e. nothing will have
happened here or at the end of this series to how quota are done! This
a change necessary because we will at the end of this series make
ownership changes easier to reason about by keeping the original value
in struct iattr for both non-idmapped and idmapped mounts.

For now we always pass the initial idmapping which makes the idmapping
functions these helpers call nops.

This is done because we currently always pass the actual value to be
written to i_{g,u}id via struct iattr. While this allowed us to treat
the {g,u}id values in struct iattr as values that can be directly
written to inode->i_{g,u}id it also increases the potential for
confusion for filesystems.

Now that we are have dedicated types to prevent this confusion we will
ultimately only map the value from the idmapped mount into a filesystem
value that can be written to inode->i_{g,u}id when the filesystem
actually updates the inode. So pass down the initial idmapping until we
finished that conversion at which point we pass down the mount's
idmapping.

Since struct iattr uses an anonymous union with overlapping types as
supported by the C standard, filesystems that haven't converted to
ia_vfs{g,u}id won't see any difference and things will continue to work
as before. In other words, no functional changes intended with this
change.

Link: https://lore.kernel.org/r/20220621141454.2914719-7-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>


# c1c1204c 23-May-2022 Damien Le Moal <damien.lemoal@opensource.wdc.com>

zonefs: fix zonefs_iomap_begin() for reads

If a readahead is issued to a sequential zone file with an offset
exactly equal to the current file size, the iomap type is set to
IOMAP_UNWRITTEN, which will prevent an IO, but the iomap length is
calculated as 0. This causes a WARN_ON() in iomap_iter():

[17309.548939] WARNING: CPU: 3 PID: 2137 at fs/iomap/iter.c:34 iomap_iter+0x9cf/0xe80
[...]
[17309.650907] RIP: 0010:iomap_iter+0x9cf/0xe80
[...]
[17309.754560] Call Trace:
[17309.757078] <TASK>
[17309.759240] ? lock_is_held_type+0xd8/0x130
[17309.763531] iomap_readahead+0x1a8/0x870
[17309.767550] ? iomap_read_folio+0x4c0/0x4c0
[17309.771817] ? lockdep_hardirqs_on_prepare+0x400/0x400
[17309.778848] ? lock_release+0x370/0x750
[17309.784462] ? folio_add_lru+0x217/0x3f0
[17309.790220] ? reacquire_held_locks+0x4e0/0x4e0
[17309.796543] read_pages+0x17d/0xb60
[17309.801854] ? folio_add_lru+0x238/0x3f0
[17309.807573] ? readahead_expand+0x5f0/0x5f0
[17309.813554] ? policy_node+0xb5/0x140
[17309.819018] page_cache_ra_unbounded+0x27d/0x450
[17309.825439] filemap_get_pages+0x500/0x1450
[17309.831444] ? filemap_add_folio+0x140/0x140
[17309.837519] ? lock_is_held_type+0xd8/0x130
[17309.843509] filemap_read+0x28c/0x9f0
[17309.848953] ? zonefs_file_read_iter+0x1ea/0x4d0 [zonefs]
[17309.856162] ? trace_contention_end+0xd6/0x130
[17309.862416] ? __mutex_lock+0x221/0x1480
[17309.868151] ? zonefs_file_read_iter+0x166/0x4d0 [zonefs]
[17309.875364] ? filemap_get_pages+0x1450/0x1450
[17309.881647] ? __mutex_unlock_slowpath+0x15e/0x620
[17309.888248] ? wait_for_completion_io_timeout+0x20/0x20
[17309.895231] ? lock_is_held_type+0xd8/0x130
[17309.901115] ? lock_is_held_type+0xd8/0x130
[17309.906934] zonefs_file_read_iter+0x356/0x4d0 [zonefs]
[17309.913750] new_sync_read+0x2d8/0x520
[17309.919035] ? __x64_sys_lseek+0x1d0/0x1d0

Furthermore, this causes iomap_readahead() to loop forever as
iomap_readahead_iter() always returns 0, making no progress.

Fix this by treating reads after the file size as access to holes,
setting the iomap type to IOMAP_HOLE, the iomap addr to IOMAP_NULL_ADDR
and using the length argument as is for the iomap length. To simplify
the code with this change, zonefs_iomap_begin() is split into the read
variant, zonefs_read_iomap_begin() and zonefs_read_iomap_ops, and the
write variant, zonefs_write_iomap_begin() and zonefs_write_iomap_ops.

Reported-by: Jorgen Hansen <Jorgen.Hansen@wdc.com>
Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system")
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Jorgen Hansen <Jorgen.Hansen@wdc.com>


# 96eca145 02-Jun-2022 Damien Le Moal <damien.lemoal@opensource.wdc.com>

zonefs: Do not ignore explicit_open with active zone limit

A zoned device may have no limit on the number of open zones but may
have a limit on the number of active zones it can support. In such
case, the explicit_open mount option should not be ignored to ensure
that the open() system call activates the zone with an explicit zone
open command, thus guaranteeing that the zone can be written.

Enforce this by ignoring the explicit_open mount option only for
devices that have both the open and active zone limits equal to 0.

Fixes: 87c9ce3ffec9 ("zonefs: Add active seq file accounting")
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# a2a513be 02-Jun-2022 Damien Le Moal <damien.lemoal@opensource.wdc.com>

zonefs: fix handling of explicit_open option on mount

Ignoring the explicit_open mount option on mount for devices that do not
have a limit on the number of open zones must be done after the mount
options are parsed and set in s_mount_opts. Move the check to ignore
the explicit_open option after the call to zonefs_parse_options() in
zonefs_fill_super().

Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close")
Cc: <stable@vger.kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>


# 14bdb047 23-May-2022 Damien Le Moal <damien.lemoal@opensource.wdc.com>

zonefs: Fix zonefs_init_file_inode() return value

Commit 87c9ce3ffec9 ("zonefs: Add active seq file accounting") wrongly
changed zonefs_init_file_inode() to always return 0 even if the call to
zonefs_zone_mgmt() fails. Fix this by propagating zonefs_zone_mgmt()
return value as the return value for zonefs_init_file_inode().

Fixes: 87c9ce3ffec9 ("zonefs: Add active seq file accounting")
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>


# 786f847f 05-May-2022 Christoph Hellwig <hch@lst.de>

iomap: add per-iomap_iter private data

Allow the file system to keep state for all iterations. For now only
wire it up for direct I/O as there is an immediate need for it there.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>


# 8597447d 30-Apr-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

iomap: Convert to release_folio

Change all the filesystems which used iomap_releasepage to use the
new function.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>


# 7479c505 29-Apr-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Convert iomap_readpage to iomap_read_folio

A straightforward conversion as iomap_readpage already worked in folios.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>


# 87c9ce3f 12-Apr-2022 Damien Le Moal <damien.lemoal@opensource.wdc.com>

zonefs: Add active seq file accounting

Modify struct zonefs_sb_info to add the s_active_seq_files atomic to
count the number of seq files representing a zone that is partially
written or explicitly open, that is, to count sequential files with
a zone that is in an active state on the device.

The helper function zonefs_account_active() is introduced to update
this counter whenever a file is written or truncated. This helper is
also used in the zonefs_seq_file_write_open() and
zonefs_seq_file_write_close() functions when the explicit_open mount
option is used.

The s_active_seq_files counter is exported through sysfs using the
read-only attribute nr_active_seq_files. The device maximum number of
active zones is also exported through sysfs with the read-only attribute
max_active_seq_files.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>


# 9277a6d4 12-Apr-2022 Damien Le Moal <damien.lemoal@opensource.wdc.com>

zonefs: Export open zone resource information through sysfs

To allow applications to easily check the current usage status of the
open zone resources of the mounted device, export through sysfs the
counter of write open sequential files s_wro_seq_files field of
struct zonefs_sb_info. The attribute is named nr_wro_seq_files and is
read only.

The maximum number of write open sequential files (zones) indicated by
the s_max_wro_seq_files field of struct zonefs_sb_info is also exported
as the read only attribute max_wro_seq_files.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>


# 7d6dfbe0 12-Apr-2022 Damien Le Moal <damien.lemoal@opensource.wdc.com>

zonefs: Always do seq file write open accounting

The explicit_open mount option forces an explicitly open of the zone of
sequential files that are open for writing to ensure that the open file
can be written without the device failing write operations due to open
zone resources limit being exceeded. To implement this, zonefs accounts
all write open seq file when this mount option is used.

This accounting however can be easily performed even when the
explicit_open mount option is not used, thus allowing applications to
control zone resources on their own, without relying on open() system
call failures from zonefs.

To implement this, the helper zonefs_file_use_exp_open() is removed and
replaced with the helper zonefs_seq_file_need_wro() which test if a file
is a sequential file being open with write access. zonefs_open_zone()
and zonefs_close_zone() are renamed respectively to
zonefs_seq_file_write_open() and zonefs_seq_file_write_close() and
modified to update the s_wro_seq_files counter regardless of the
explicit_open mount option use.

If the explicit_open mount option is used, zonefs_seq_file_write_open()
execute an explicit zone open operation for a sequential file open for
writing for the first time, as before.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>


# 2b95a23c 12-Apr-2022 Damien Le Moal <damien.lemoal@opensource.wdc.com>

zonefs: Rename super block information fields

The s_open_zones field of struct zonefs_sb_info is used to count the
number of files that are open for writing and may not necessarilly
correspond to the number of open zones on the device. For instance, an
application may open for writing a sequential zone file, fully write it
and keep the file open. In such case, the zone of the file is not open
anymore (it is in the full state).

Avoid confusion about this counter meaning by renaming it to
s_wro_seq_files. To keep things consistent, the field s_max_open_zones
is renamed to s_max_wro_seq_files.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>


# 19139539 12-Apr-2022 Damien Le Moal <damien.lemoal@opensource.wdc.com>

zonefs: Fix management of open zones

The mount option "explicit_open" manages the device open zone
resources to ensure that if an application opens a sequential file for
writing, the file zone can always be written by explicitly opening
the zone and accounting for that state with the s_open_zones counter.

However, if some zones are already open when mounting, the device open
zone resource usage status will be larger than the initial s_open_zones
value of 0. Ensure that this inconsistency does not happen by closing
any sequential zone that is open when mounting.

Furthermore, with ZNS drives, closing an explicitly open zone that has
not been written will change the zone state to "closed", that is, the
zone will remain in an active state. Since this can then cause failures
of explicit open operations on other zones if the drive active zone
resources are exceeded, we need to make sure that the zone is not
active anymore by resetting it instead of closing it. To address this,
zonefs_zone_mgmt() is modified to change a REQ_OP_ZONE_CLOSE request
into a REQ_OP_ZONE_RESET for sequential zones that have not been
written.

Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close")
Cc: <stable@vger.kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>


# b954ebba 12-Apr-2022 Damien Le Moal <damien.lemoal@opensource.wdc.com>

zonefs: Clear inode information flags on inode creation

Ensure that the i_flags field of struct zonefs_inode_info is cleared to
0 when initializing a zone file inode, avoiding seeing the flag
ZONEFS_ZONE_OPEN being incorrectly set.

Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close")
Cc: <stable@vger.kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>


# 2aba0d19 14-Apr-2022 Christoph Hellwig <hch@lst.de>

block: add a bdev_max_zone_append_sectors helper

Add a helper to check the max supported sectors for zone append based on
the block_device instead of having to poke into the block layer internal
request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1da18a29 12-Apr-2022 Damien Le Moal <damien.lemoal@opensource.wdc.com>

zonefs: Fix management of open zones

The mount option "explicit_open" manages the device open zone
resources to ensure that if an application opens a sequential file for
writing, the file zone can always be written by explicitly opening
the zone and accounting for that state with the s_open_zones counter.

However, if some zones are already open when mounting, the device open
zone resource usage status will be larger than the initial s_open_zones
value of 0. Ensure that this inconsistency does not happen by closing
any sequential zone that is open when mounting.

Furthermore, with ZNS drives, closing an explicitly open zone that has
not been written will change the zone state to "closed", that is, the
zone will remain in an active state. Since this can then cause failures
of explicit open operations on other zones if the drive active zone
resources are exceeded, we need to make sure that the zone is not
active anymore by resetting it instead of closing it. To address this,
zonefs_zone_mgmt() is modified to change a REQ_OP_ZONE_CLOSE request
into a REQ_OP_ZONE_RESET for sequential zones that have not been
written.

Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close")
Cc: <stable@vger.kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>


# 694852ea 12-Apr-2022 Damien Le Moal <damien.lemoal@opensource.wdc.com>

zonefs: Clear inode information flags on inode creation

Ensure that the i_flags field of struct zonefs_inode_info is cleared to
0 when initializing a zone file inode, avoiding seeing the flag
ZONEFS_ZONE_OPEN being incorrectly set.

Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close")
Cc: <stable@vger.kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>


# c75e707f 04-Mar-2022 Christoph Hellwig <hch@lst.de>

block: remove the per-bio/request write hint

With the NVMe support for this gone, there are no consumers of these hints
left, so remove them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220304175556.407719-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# fd60b288 22-Mar-2022 Muchun Song <songmuchun@bytedance.com>

fs: allocate inode by using alloc_inode_sb()

The inode allocation is supposed to use alloc_inode_sb(), so convert
kmem_cache_alloc() of all filesystems to alloc_inode_sb().

Link: https://lkml.kernel.org/r/20220228122126.37293-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Theodore Ts'o <tytso@mit.edu> [ext4]
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Alex Shi <alexs@kernel.org>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kari Argillander <kari.argillander@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 187c82cb 09-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio

These filesystems use __set_page_dirty_nobuffers() either directly or
with a very thin wrapper; convert them en masse.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs


# d82354f6 09-Feb-2022 Matthew Wilcox (Oracle) <willy@infradead.org>

iomap: Remove iomap_invalidatepage()

Use iomap_invalidate_folio() in all the iomap-based filesystems
and rename the iomap_invalidatepage tracepoint.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs


# 49add496 24-Jan-2022 Christoph Hellwig <hch@lst.de>

block: pass a block_device and opf to bio_init

Pass the block_device that we plan to use this bio for and the
operation to bio_init to optimize the assignment. A NULL block_device
can be passed, both for the passthrough case on a raw request_queue and
to temporarily avoid refactoring some nasty code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-19-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 07888c66 24-Jan-2022 Christoph Hellwig <hch@lst.de>

block: pass a block_device and opf to bio_alloc

Pass the block_device and operation that we plan to use this bio for to
bio_alloc to optimize the assignment. NULL/0 can be passed, both for the
passthrough case on a raw request_queue and to temporarily avoid
refactoring some nasty code.

Also move the gfp_mask argument after the nr_vecs argument for a much
more logical calling convention matching what most of the kernel does.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8ffea259 16-Dec-2021 Naohiro Aota <naohiro.aota@wdc.com>

zonefs: add MODULE_ALIAS_FS

Add MODULE_ALIAS_FS() to load the module automatically when you do "mount
-t zonefs".

Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system")
Cc: stable <stable@vger.kernel.org> # 5.6+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <jth@kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>


# 4fdccaa0 23-Jul-2021 Andreas Gruenbacher <agruenba@redhat.com>

iomap: Add done_before argument to iomap_dio_rw

Add a done_before argument to iomap_dio_rw that indicates how much of
the request has already been transferred. When the request succeeds, we
report that done_before additional bytes were tranferred. This is
useful for finishing a request asynchronously when part of the request
has already been completed synchronously.

We'll use that to allow iomap_dio_rw to be used with page faults
disabled: when a page fault occurs while submitting a request, we
synchronously complete the part of the request that has already been
submitted. The caller can then take care of the page fault and call
iomap_dio_rw again for the rest of the request, passing in the number of
bytes already tranferred.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>


# 3e08773c 12-Oct-2021 Christoph Hellwig <hch@lst.de>

block: switch polling to be bio based

Replace the blk_poll interface that requires the caller to keep a queue
and cookie from the submissions with polling based on the bio.

Polling for the bio itself leads to a few advantages:

- the cookie construction can made entirely private in blk-mq.c
- the caller does not need to remember the request_queue and cookie
separately and thus sidesteps their lifetime issues
- keeping the device and the cookie inside the bio allows to trivially
support polling BIOs remapping by stacking drivers
- a lot of code to propagate the cookie back up the submission path can
be removed entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 448f9490 13-Apr-2021 Jan Kara <jack@suse.cz>

zonefs: Convert to using invalidate_lock

Use invalidate_lock instead of zonefs' private i_mmap_sem. The intended
purpose is exactly the same.

CC: Damien Le Moal <damien.lemoal@wdc.com>
CC: Johannes Thumshirn <jth@kernel.org>
CC: <linux-fsdevel@vger.kernel.org>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>


# 2f53d15c 30-Jun-2021 Xianting Tian <xianting.tian@linux.alibaba.com>

zonefs: remove redundant null bio check

bio_alloc() with __GFP_DIRECT_RECLAIM, which is included in
GFP_NOFS, never fails, see comments in bio_alloc_bioset().

Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>


# 3a6b2162 28-Jun-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: move page dirtying prototypes from mm.h

These functions implement the address_space ->set_page_dirty operation and
should live in pagemap.h, not mm.h so that the rest of the kernel doesn't
get funny ideas about calling them directly.

Link: https://lkml.kernel.org/r/20210615162342.1669332-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fd7353f8 28-Jun-2021 Matthew Wilcox (Oracle) <willy@infradead.org>

iomap: use __set_page_dirty_nobuffers

The only difference between iomap_set_page_dirty() and
__set_page_dirty_nobuffers() is that the latter includes a debugging check
that a !Uptodate page has private data.

Link: https://lkml.kernel.org/r/20210615162342.1669332-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9591c3a3 22-Mar-2021 Amir Goldstein <amir73il@gmail.com>

fs: introduce a wrapper uuid_to_fsid()

Some filesystem's use a digest of their uuid for f_fsid.
Create a simple wrapper for this open coded folding.

Filesystems that have a non null uuid but use the block device
number for f_fsid may also consider using this helper.

[JK: Added missing asm/byteorder.h include]
Link: https://lore.kernel.org/r/20210322173944.449469-2-amir73il@gmail.com
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>


# 6980d29c 16-Mar-2021 Chao Yu <chao@kernel.org>

zonefs: fix to update .i_wr_refcnt correctly in zonefs_open_zone()

In zonefs_open_zone(), if opened zone count is larger than
.s_max_open_zones threshold, we missed to recover .i_wr_refcnt,
fix this.

Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close")
Cc: <stable@vger.kernel.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>


# ebfd68cd 09-Mar-2021 Damien Le Moal <damien.lemoal@wdc.com>

zonefs: Fix O_APPEND async write handling

zonefs updates the size of a sequential zone file inode only on
completion of direct writes. When executing asynchronous append writes
(with a file open with O_APPEND or using RWF_APPEND), the use of the
current inode size in generic_write_checks() to set an iocb offset thus
leads to unaligned write if an application issues an append write
operation with another write already being executed.

Fix this problem by introducing zonefs_write_checks() as a modified
version of generic_write_checks() using the file inode wp_offset for an
append write iocb offset. Also introduce zonefs_write_check_limits() to
replace generic_write_check_limits() call. This zonefs special helper
makes sure that the maximum file limit used is the maximum size of the
file being accessed.

Since zonefs_write_checks() already truncates the iov_iter, the calls
to iov_iter_truncate() in zonefs_file_dio_write() and
zonefs_file_buffered_write() are removed.

Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system")
Cc: <stable@vger.kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>


# 1601ea06 14-Mar-2021 Damien Le Moal <damien.lemoal@wdc.com>

zonefs: prevent use of seq files as swap file

The sequential write constraint of sequential zone file prevent their
use as swap files. Only allow conventional zone files to be used as swap
files.

Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system")
Cc: <stable@vger.kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>


# a8affc03 10-Mar-2021 Christoph Hellwig <hch@lst.de>

block: rename BIO_MAX_PAGES to BIO_MAX_VECS

Ever since the addition of multipage bio_vecs BIO_MAX_PAGES has been
horribly confusingly misnamed. Rename it to BIO_MAX_VECS to stop
confusing users of the bio API.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210311110137.1132391-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 549c7297 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

fs: make helpers idmap mount aware

Extend some inode methods with an additional user namespace argument. A
filesystem that is aware of idmapped mounts will receive the user
namespace the mount has been marked with. This can be used for
additional permission checking and also to enable filesystems to
translate between uids and gids if they need to. We have implemented all
relevant helpers in earlier patches.

As requested we simply extend the exisiting inode method instead of
introducing new ones. This is a little more code churn but it's mostly
mechanical and doesnt't leave us with additional inode methods.

Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# 2f221d6f 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

attr: handle idmapped mounts

When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# 21cb47be 21-Jan-2021 Christian Brauner <christian.brauner@ubuntu.com>

inode: make init and permission helpers idmapped mount aware

The inode_owner_or_capable() helper determines whether the caller is the
owner of the inode or is capable with respect to that inode. Allow it to
handle idmapped mounts. If the inode is accessed through an idmapped
mount it according to the mount's user namespace. Afterwards the checks
are identical to non-idmapped mounts. If the initial user namespace is
passed nothing changes so non-idmapped mounts will see identical
behavior as before.

Similarly, allow the inode_init_owner() helper to handle idmapped
mounts. It initializes a new inode on idmapped mounts by mapping the
fsuid and fsgid of the caller from the mount's user namespace. If the
initial user namespace is passed nothing changes so non-idmapped mounts
will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>


# 059c0103 17-Feb-2021 Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>

zonefs: Fix file size of zones in full condition

Per ZBC/ZAC/ZNS specifications, write pointers may not have valid values
when zones are in full condition. However, when zonefs mounts a zoned
block device, zonefs refers write pointers to set file size even when
the zones are in full condition. This results in wrong file size. To fix
this, refer maximum file size in place of write pointers for zones in
full condition.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system")
Cc: <stable@vger.kernel.org> # 5.6+
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>


# 62ab1aad 26-Jan-2021 Johannes Thumshirn <johannes.thumshirn@wdc.com>

zonefs: add tracepoints for file operations

Add tracepoints for file I/O operations to aid in debugging of I/O errors
with zonefs.

The added tracepoints are in:
- zonefs_zone_mgmt() for tracing zone management operations
- zonefs_iomap_begin() for tracing regular file I/O
- zonefs_file_dio_append() for tracing zone-append operations

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>


# 0f1ba5f5 27-Jan-2021 Damien Le Moal <damien.lemoal@wdc.com>

zonefs: use zone write granularity as block size

Zoned block devices have different granularity constraints for write
operations into sequential zones. E.g. ZBC and ZAC devices require that
writes be aligned to the device physical block size while NVMe ZNS
devices allow logical block size aligned write operations. To correctly
handle such difference, use the device zone write granularity limit to
set the block size of a zonefs volume, thus allowing the smallest
possible write unit for all zoned device types.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# c6bf3f0e 26-Jan-2021 Christoph Hellwig <hch@lst.de>

block: use an on-stack bio in blkdev_issue_flush

There is no point in allocating memory for a synchronous flush.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# f91ca2a3 26-Jan-2021 Christoph Hellwig <hch@lst.de>

zonefs: use bio_alloc in zonefs_file_dio_append

Use bio_alloc instead of open coding it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 2f632965 23-Jan-2021 Christoph Hellwig <hch@lst.de>

iomap: pass a flags argument to iomap_dio_rw

Pass a set of flags to iomap_dio_rw instead of the boolean
wait_for_completion argument. The IOMAP_DIO_FORCE_WAIT flag
replaces the wait_for_completion, but only needs to be passed
when the iocb isn't synchronous to start with to simplify the
callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
[djwong: rework xfs_file.c so that we can push iomap changes separately]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>


# 6bea0225 09-Dec-2020 Damien Le Moal <damien.lemoal@wdc.com>

zonefs: fix page reference and BIO leak

In zonefs_file_dio_append(), the pages obtained using
bio_iov_iter_get_pages() are not released on completion of the
REQ_OP_APPEND BIO, nor when bio_iov_iter_get_pages() fails.
Furthermore, a call to bio_put() is missing when
bio_iov_iter_get_pages() fails.

Fix these resource leaks by adding BIO resource release code (bio_put()i
and bio_release_pages()) at the end of the function after the BIO
execution and add a jump to this resource cleanup code in case of
bio_iov_iter_get_pages() failure.

While at it, also fix the call to task_io_account_write() to be passed
the correct BIO size instead of bio_iov_iter_get_pages() return value.

Reported-by: Christoph Hellwig <hch@lst.de>
Fixes: 02ef12a663c7 ("zonefs: use REQ_OP_ZONE_APPEND for sync DIO")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>


# 6d1349c7 18-Sep-2020 Al Viro <viro@zeniv.linux.org.uk>

[PATCH] reduce boilerplate in fsid handling

Get rid of boilerplate in most of ->statfs()
instances...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# b5c00e97 11-Sep-2020 Johannes Thumshirn <johannes.thumshirn@wdc.com>

zonefs: open/close zone on file open/close

NVMe Zoned Namespace introduced the concept of active zones, which are
zones in the implicit open, explicit open or closed condition. Drives may
have a limit on the number of zones that can be simultaneously active.
This potential limitation translate into a risk for applications to see
write IO errors due to this limit if the zone of a file being written to is
not already active when a write request is issued.

To avoid these potential errors, the zone of a file can explicitly be made
active using an open zone command when the file is open for the first
time. If the zone open command succeeds, the application is then
guaranteed that write requests can be processed. This indirect management
of active zones relies on the maximum number of open zones of a drive,
which is always lower or equal to the maximum number of active zones.

On the first open of a sequential zone file, send a REQ_OP_ZONE_OPEN
command to the block device. Conversely, on the last release of a zone
file and send a REQ_OP_ZONE_CLOSE to the device if the zone is not full or
empty.

As truncating a zone file to 0 or max can deactivate a zone as well, we
need to serialize against truncates and also be careful not to close a
zone as the file may still be open for writing, e.g. the user called
ftruncate(). If the zone file is not open and a process does a truncate(),
then no close operation is needed.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>


# 48d546a8 11-Sep-2020 Johannes Thumshirn <johannes.thumshirn@wdc.com>

zonefs: provide no-lock zonefs_io_error variant

Subsequent patches need to call zonefs_io_error() with the i_truncate_mutex
already held, so factor out the body of zonefs_io_error() into
__zonefs_io_error() which can be called from with the i_truncate_mutex
held.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>


# 5498d5f9 11-Sep-2020 Johannes Thumshirn <johannes.thumshirn@wdc.com>

zonefs: introduce helper for zone management

Introduce a helper function for sending zone management commands to the
block device.

As zone management commands can change a zone write pointer position
reflected in the size of the zone file, this function expects the truncate
mutex to be held.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>


# e3c3155b 21-Jul-2020 Johannes Thumshirn <johannes.thumshirn@wdc.com>

zonefs: add zone-capacity support

In the zoned storage model, the sectors within a zone are typically all
writeable. With the introduction of the Zoned Namespace (ZNS) Command
Set in the NVM Express organization, the model was extended to have a
specific writeable capacity.

This zone capacity can be less than the overall zone size for a NVMe ZNS
device or null_blk in zoned-mode. For other ZBC/ZAC devices the zone
capacity is always equal to the zone size.

Use the zone capacity field instead from blk_zone for determining the
maximum inode size and inode blocks in zonefs.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>


# 60263d58 23-Jul-2020 Christoph Hellwig <hch@lst.de>

iomap: fall back to buffered writes for invalidation failures

Failing to invalid the page cache means data in incoherent, which is
a very bad state for the system. Always fall back to buffered I/O
through the page cache if we can't invalidate mappings.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu> # for ext4
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> # for gfs2
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>


# 89ee7237 16-Jul-2020 Johannes Thumshirn <johannes.thumshirn@wdc.com>

zonefs: count pages after truncating the iterator

Count pages after possibly truncating the iterator to the maximum zone
append size, not before.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>


# 01b2651c 20-Jul-2020 Damien Le Moal <damien.lemoal@wdc.com>

zonefs: Fix compilation warning

Avoid the compilation warning "Variable 'ret' is reassigned a value
before the old one has been used." in zonefs_create_zgroup() by setting
ret for the error path only if an error happens.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>


# 9d24a13a 01-Jun-2020 Matthew Wilcox (Oracle) <willy@infradead.org>

iomap: convert from readpages to readahead

Use the new readahead operation in iomap. Convert XFS and ZoneFS to use
it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-26-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9398554f 13-May-2020 Christoph Hellwig <hch@lst.de>

block: remove the error_sector argument to blkdev_issue_flush

The argument isn't used by any caller, and drivers don't fill out
bi_sector for flush requests either.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 02ef12a6 12-May-2020 Johannes Thumshirn <johannes.thumshirn@wdc.com>

zonefs: use REQ_OP_ZONE_APPEND for sync DIO

Synchronous direct I/O to a sequential write only zone can be issued using
the new REQ_OP_ZONE_APPEND request operation. As dispatching multiple
BIOs can potentially result in reordering, we cannot support asynchronous
IO via this interface.

We also can only dispatch up to queue_max_zone_append_sectors() via the
new zone-append method and have to return a short write back to user-space
in case an IO larger than queue_max_zone_append_sectors() has been issued.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 568776f9 23-Apr-2020 Andy Shevchenko <andriy.shevchenko@linux.intel.com>

zonefs: Replace uuid_copy() with import_uuid()

There is a specific API to treat raw data as UUID, i.e. import_uuid().
Use it instead of uuid_copy() with explicit casting.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>


# ccf4ad7d 20-Mar-2020 Damien Le Moal <damien.lemoal@wdc.com>

zonfs: Fix handling of read-only zones

The write pointer of zones in the read-only consition is defined as
invalid by the SCSI ZBC and ATA ZAC specifications. It is thus not
possible to determine the correct size of a read-only zone file on
mount. Fix this by handling read-only zones in the same manner as
offline zones by disabling all accesses to the zone (read and write)
and initializing the inode size of the read-only zone to 0).

For zones found to be in the read-only condition at runtime, only
disable write access to the zone and keep the size of the zone file to
its last updated value to allow the user to recover previously written
data.

Also fix zonefs documentation file to reflect this change.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>


# 7c69eb84 21-Feb-2020 Christoph Hellwig <hch@lst.de>

zonefs: fix IOCB_NOWAIT handling

IOCB_NOWAIT can't just be ignored as it breaks applications expecting
it not to block. Just refuse the operation as applications must handle
that (e.g. by falling back to a thread pool).

Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>


# 8dcc1a9d 25-Dec-2019 Damien Le Moal <damien.lemoal@wdc.com>

fs: New zonefs file system

zonefs is a very simple file system exposing each zone of a zoned block
device as a file. Unlike a regular file system with zoned block device
support (e.g. f2fs), zonefs does not hide the sequential write
constraint of zoned block devices to the user. Files representing
sequential write zones of the device must be written sequentially
starting from the end of the file (append only writes).

As such, zonefs is in essence closer to a raw block device access
interface than to a full featured POSIX file system. The goal of zonefs
is to simplify the implementation of zoned block device support in
applications by replacing raw block device file accesses with a richer
file API, avoiding relying on direct block device file ioctls which may
be more obscure to developers. One example of this approach is the
implementation of LSM (log-structured merge) tree structures (such as
used in RocksDB and LevelDB) on zoned block devices by allowing SSTables
to be stored in a zone file similarly to a regular file system rather
than as a range of sectors of a zoned device. The introduction of the
higher level construct "one file is one zone" can help reducing the
amount of changes needed in the application as well as introducing
support for different application programming languages.

Zonefs on-disk metadata is reduced to an immutable super block to
persistently store a magic number and optional feature flags and
values. On mount, zonefs uses blkdev_report_zones() to obtain the device
zone configuration and populates the mount point with a static file tree
solely based on this information. E.g. file sizes come from the device
zone type and write pointer offset managed by the device itself.

The zone files created on mount have the following characteristics.
1) Files representing zones of the same type are grouped together
under a common sub-directory:
* For conventional zones, the sub-directory "cnv" is used.
* For sequential write zones, the sub-directory "seq" is used.
These two directories are the only directories that exist in zonefs.
Users cannot create other directories and cannot rename nor delete
the "cnv" and "seq" sub-directories.
2) The name of zone files is the number of the file within the zone
type sub-directory, in order of increasing zone start sector.
3) The size of conventional zone files is fixed to the device zone size.
Conventional zone files cannot be truncated.
4) The size of sequential zone files represent the file's zone write
pointer position relative to the zone start sector. Truncating these
files is allowed only down to 0, in which case, the zone is reset to
rewind the zone write pointer position to the start of the zone, or
up to the zone size, in which case the file's zone is transitioned
to the FULL state (finish zone operation).
5) All read and write operations to files are not allowed beyond the
file zone size. Any access exceeding the zone size is failed with
the -EFBIG error.
6) Creating, deleting, renaming or modifying any attribute of files and
sub-directories is not allowed.
7) There are no restrictions on the type of read and write operations
that can be issued to conventional zone files. Buffered, direct and
mmap read & write operations are accepted. For sequential zone files,
there are no restrictions on read operations, but all write
operations must be direct IO append writes. mmap write of sequential
files is not allowed.

Several optional features of zonefs can be enabled at format time.
* Conventional zone aggregation: ranges of contiguous conventional
zones can be aggregated into a single larger file instead of the
default one file per zone.
* File ownership: The owner UID and GID of zone files is by default 0
(root) but can be changed to any valid UID/GID.
* File access permissions: the default 640 access permissions can be
changed.

The mkzonefs tool is used to format zoned block devices for use with
zonefs. This tool is available on Github at:

git@github.com:damien-lemoal/zonefs-tools.git.

zonefs-tools also includes a test suite which can be run against any
zoned block device, including null_blk block device created with zoned
mode.

Example: the following formats a 15TB host-managed SMR HDD with 256 MB
zones with the conventional zones aggregation feature enabled.

$ sudo mkzonefs -o aggr_cnv /dev/sdX
$ sudo mount -t zonefs /dev/sdX /mnt
$ ls -l /mnt/
total 0
dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv
dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq

The size of the zone files sub-directories indicate the number of files
existing for each type of zones. In this example, there is only one
conventional zone file (all conventional zones are aggregated under a
single file).

$ ls -l /mnt/cnv
total 137101312
-rw-r----- 1 root root 140391743488 Nov 25 13:23 0

This aggregated conventional zone file can be used as a regular file.

$ sudo mkfs.ext4 /mnt/cnv/0
$ sudo mount -o loop /mnt/cnv/0 /data

The "seq" sub-directory grouping files for sequential write zones has
in this example 55356 zones.

$ ls -lv /mnt/seq
total 14511243264
-rw-r----- 1 root root 0 Nov 25 13:23 0
-rw-r----- 1 root root 0 Nov 25 13:23 1
-rw-r----- 1 root root 0 Nov 25 13:23 2
...
-rw-r----- 1 root root 0 Nov 25 13:23 55354
-rw-r----- 1 root root 0 Nov 25 13:23 55355

For sequential write zone files, the file size changes as data is
appended at the end of the file, similarly to any regular file system.

$ dd if=/dev/zero of=/mnt/seq/0 bs=4K count=1 conv=notrunc oflag=direct
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000452219 s, 9.1 MB/s

$ ls -l /mnt/seq/0
-rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0

The written file can be truncated to the zone size, preventing any
further write operation.

$ truncate -s 268435456 /mnt/seq/0
$ ls -l /mnt/seq/0
-rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0

Truncation to 0 size allows freeing the file zone storage space and
restart append-writes to the file.

$ truncate -s 0 /mnt/seq/0
$ ls -l /mnt/seq/0
-rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0

Since files are statically mapped to zones on the disk, the number of
blocks of a file as reported by stat() and fstat() indicates the size
of the file zone.

$ stat /mnt/seq/0
File: /mnt/seq/0
Size: 0 Blocks: 524288 IO Block: 4096 regular empty file
Device: 870h/2160d Inode: 50431 Links: 1
Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2019-11-25 13:23:57.048971997 +0900
Modify: 2019-11-25 13:52:25.553805765 +0900
Change: 2019-11-25 13:52:25.553805765 +0900
Birth: -

The number of blocks of the file ("Blocks") in units of 512B blocks
gives the maximum file size of 524288 * 512 B = 256 MB, corresponding
to the device zone size in this example. Of note is that the "IO block"
field always indicates the minimum IO size for writes and corresponds
to the device physical sector size.

This code contains contributions from:
* Johannes Thumshirn <jthumshirn@suse.de>,
* Darrick J. Wong <darrick.wong@oracle.com>,
* Christoph Hellwig <hch@lst.de>,
* Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> and
* Ting Yao <tingyao@hust.edu.cn>.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>